Skip to content

Latest commit

 

History

History
172 lines (145 loc) · 5.29 KB

streaming.mdx

File metadata and controls

172 lines (145 loc) · 5.29 KB
sidebar_position
3

Streaming the Output

For more interactive UIs during generation, you can stream output tokens.

:::info Streaming is supported for both LLMPipeline and VLMPipeline. :::

Streaming Function

In this example, a function outputs words to the console immediately upon generation:

```python showLineNumbers import openvino_genai as ov_genai pipe = ov_genai.LLMPipeline(model_path, "CPU")
    # highlight-start
    # Create a streamer function
    def streamer(subword):
        print(subword, end='', flush=True)
        # Return flag corresponds whether generation should be stopped.
        return ov_genai.StreamingStatus.RUNNING
    # highlight-end

    # highlight-next-line
    pipe.start_chat()
    while True:
        try:
            prompt = input('question:\n')
        except EOFError:
            break
        # highlight-next-line
        pipe.generate(prompt, streamer=streamer, max_new_tokens=100)
        print('\n----------\n')
    # highlight-next-line
    pipe.finish_chat()
    ```
</TabItemPython>
<TabItemCpp>
    ```cpp showLineNumbers
    #include "openvino/genai/llm_pipeline.hpp"
    #include <iostream>

    int main(int argc, char* argv[]) {
        std::string prompt;

        std::string model_path = argv[1];
        ov::genai::LLMPipeline pipe(model_path, "CPU");

        // highlight-start
        // Create a streamer function
        auto streamer = [](std::string word) {
            std::cout << word << std::flush;
            // Return flag corresponds whether generation should be stopped.
            return ov::genai::StreamingStatus::RUNNING;
        };
        // highlight-end

        // highlight-next-line
        pipe.start_chat();
        std::cout << "question:\n";
        while (std::getline(std::cin, prompt)) {
            // highlight-next-line
            pipe.generate(prompt, ov::genai::streamer(streamer), ov::genai::max_new_tokens(100));
            std::cout << "\n----------\n"
                "question:\n";
        }
        // highlight-next-line
        pipe.finish_chat();
    }
    ```
</TabItemCpp>

Custom Streamer Class

You can also create your custom streamer for more sophisticated processing:

```python showLineNumbers import openvino_genai as ov_genai pipe = ov_genai.LLMPipeline(model_path, "CPU")
    # highlight-start
    # Create custom streamer class
    class CustomStreamer(ov_genai.StreamerBase):
        def __init__(self):
            super().__init__()
            # Initialization logic.

        def write(self, token_id) -> bool:
            # Custom decoding/tokens processing logic.

            # Return flag corresponds whether generation should be stopped.
            return ov_genai.StreamingStatus.RUNNING

        def end(self):
            # Custom finalization logic.
            pass
    # highlight-end

    # highlight-next-line
    pipe.start_chat()
    while True:
        try:
            prompt = input('question:\n')
        except EOFError:
            break
        # highlight-next-line
        pipe.generate(prompt, streamer=CustomStreamer(), max_new_tokens=100)
        print('\n----------\n')
    # highlight-next-line
    pipe.finish_chat()
    ```
</TabItemPython>
<TabItemCpp>
    ```cpp showLineNumbers
    #include "openvino/genai/streamer_base.hpp"
    #include "openvino/genai/llm_pipeline.hpp"
    #include <iostream>

    // highlight-start
    // Create custom streamer class
    class CustomStreamer: public ov::genai::StreamerBase {
    public:
        bool write(int64_t token) {
            // Custom decoding/tokens processing logic.

            // Return flag corresponds whether generation should be stopped.
            return ov::genai::StreamingStatus::RUNNING;
        };

        void end() {
            // Custom finalization logic.
        };
    };
    // highlight-end

    int main(int argc, char* argv[]) {
        std::string prompt;
        // highlight-next-line
        std::shared_ptr<CustomStreamer> custom_streamer;

        std::string model_path = argv[1];
        ov::genai::LLMPipeline pipe(model_path, "CPU");

        // highlight-next-line
        pipe.start_chat();
        std::cout << "question:\n";
        while (std::getline(std::cin, prompt)) {
            // highlight-next-line
            pipe.generate(prompt, ov::genai::streamer(custom_streamer), ov::genai::max_new_tokens(100));
            std::cout << "\n----------\n"
                "question:\n";
        }
        // highlight-next-line
        pipe.finish_chat();
    }
    ```
</TabItemCpp>

:::info For fully implemented iterable CustomStreamer refer to multinomial_causal_lm sample. :::