How to serve LLM using the OpenVINO Model Server on Windows

In this step-by-step guide, we will discuss the deployment of the Large Language Model (LLM) using the OpenVINO model server. This tutorial focuses on the serving of the TinyLlama chat model REST API endpoint using the openVINO model server.

We know that OpenVINO is an open-source toolkit by Intel for optimizing and deploying deep learning models. It enhances model performance on Intel hardware, supports frameworks like PyTorch and TensorFlow, and offers tools for efficient inference across multiple platforms, including CPUs, GPUs, NPUs and VPUs.

Ensure the following dependencies are available on your Windows 11 machine :

Python 3.11 or higher
git
curl (optional)

Model Conversion

We need to convert the model LLM model to OpenVINO. Create a folder and a Python virtual environment, for that open command prompt and run the below commands :

mkdir model_converter
cd model_converter
python -m venv env
env\Scripts\activate

Next, install the dependencies by running the below command :

pip install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/requirements.txt

Then download the export_model.py file to “model_converter” folder or use the below command :

curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/export_model.py  --ssl-no-revoke -o export_model.py

Create a folder named “models” by running the below command :

mkdir models

Next, we need to convert the TinyLlama-1.1B-Chat-v1.0 model to OpenVINO using the below command :

python export_model.py text_generation --source_model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --weight-format int4 --config_file_path models/config.json --model_repository_path models

After conversion, the OpenVINO TinyLlama model is available in the “models” directory. We need the full path of “config.json” to serve the model with the OpenVINO model server. The converted OpenVINO model weights are int4 quantized.

Deploy model with OpenVINO Model Server

OpenVINO Model Server is a high-performance system for serving optimized machine learning models over standard network protocols like REST and gRPC. Built with C++ and optimized for Intel hardware, it leverages the OpenVINO toolkit for accelerated inference. OVMS offers gRPC and REST APIs, compatible with TensorFlow Serving and KServe, simplifying integration. It supports diverse frameworks and model formats, enabling scalable and efficient deployment for AI applications like image recognition and NLP. OVMS is open-source, flexible, and designed for production environments.

Download the OpenVINO model server for Windows from the GitHub releases page here . We are using OpenVINO™ Model Server 2025.0 for this tutorial, extract the zip(ovms_windows.zip) file.

Open the command prompt enter into the “ovms_windows” folder and run the below commands:

setupvars.bat

Next, we can start the OVMS to serve our model by running the below command :

ovms --config_path "<<Your converted models config.js path>>"  --rest_port 8000

Now the OVMS running at http://localhost:8000. Verify model serving is ok by requesting the API endpoint http://localhost:8000/v1/config

As we can see that model is available since the state is “available” in the above API response.

Use LLM

OpenVINO model server provides chat completions API endpoint similar to OpenAI.The chat completion endpoint of OVMS is :

http://localhost:8000/v3/chat/completions

Install OpenAI package:

pip install openai

We can test the LLM using the below code :

from openai import OpenAI

client = OpenAI(
  base_url="http://localhost:8000/v3",
  api_key="unused"
)

stream = client.chat.completions.create(
    model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    messages=[{"role": "user", "content": "What is opencv?"}],
    stream=True,
    temperature=0.1,
)
for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)

Output

Conclusion

In short, deploying your LLM model on Windows is simplified with the OpenVINO model server. Just remember to convert your chosen model with the appropriate quantization for optimal performance.

Model Conversion

Deploy model with OpenVINO Model Server

Use LLM

Conclusion

Related posts: