Model Provider

Amazon Bedrock

Amazon Sagemaker
Anthropic
AssemblyAI
Azure
Cerebras
Cohere
Elevenlabs
FastEmbed
Google AI
Google Vertex AI
Groq
Hugging Face
INSTRUCTOR Embedders
Jina AI
Llama.cpp
llamafile
LM Format Enforcer
Mistral
mixedbread ai
MonsterAPI
NVIDIA
Ollama
OpenAI
Optimum
Titan Takeoff Inference Server
vLLM Invocation Layer
- Overview
- Haystack 2.x
- Haystack 1.x
Voyage AI

Data Ingestion

Monitoring Tool

Document Store

Custom Component

Evaluation Framework

Integration: vLLM Invocation Layer

Use the vLLM inference engine with Haystack

Authors

Lukas Kreussel

GitHub Repo PyPI Package

PyPI - Version PyPI - Python Version

Simply use vLLM in your haystack pipeline, to utilize fast, self-hosted LLMs.

Haystack

Overview
Haystack 2.0
- Installation
- Usage
Haystack 1.x
- Installation (1.x)
- Usage (1.x)

Overview

vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. It is an open-source project that allows serving open models in production, when you have GPU resources available.

For Haystack 1.x, the integration is available as a separate package, while for Haystack 2.x, the integration comes out of the box.

Haystack 2.x

vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used with the OpenAIGenerator and OpenAIChatGenerator components in Haystack.

For an end-to-end example of vLLM + Haystack 2.x, see this notebook.

Installation

vLLM should be installed.

you can use pip: pip install vllm (more information in the vLLM documentation)
for production use cases, there are many other options, including Docker ( docs)

Usage

You first need to run an vLLM OpenAI-compatible server. You can do that using Python or Docker.

Then, you can use the OpenAIGenerator and OpenAIChatGenerator components in Haystack to query the vLLM server.

from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.utils import Secret

generator = OpenAIChatGenerator(
    api_key=Secret.from_token("VLLM-PLACEHOLDER-API-KEY"),  # for compatibility with the OpenAI API, a placeholder api_key is needed
    model="mistralai/Mistral-7B-Instruct-v0.1",
    api_base_url="http://localhost:8000/v1",
    generation_kwargs = {"max_tokens": 512}
)

response = generator.run(messages=[ChatMessage.from_user("Hi. Can you help me plan my next trip to Italy?")])

Haystack 1.x

Installation (1.x)

Install the wrapper via pip: pip install vllm-haystack

Usage (1.x)

This integration provides two invocation layers:

vLLMInvocationLayer: To use models hosted on a vLLM server
vLLMLocalInvocationLayer: To use locally hosted vLLM models

Use a Model Hosted on a vLLM Server

To utilize the wrapper the vLLMInvocationLayer has to be used.

Here is a simple example of how a PromptNode can be created with the wrapper.

from haystack.nodes import PromptNode, PromptModel
from vllm_haystack import vLLMInvocationLayer


model = PromptModel(model_name_or_path="", invocation_layer_class=vLLMInvocationLayer, max_length=256, api_key="EMPTY", model_kwargs={
        "api_base" : API, # Replace this with your API-URL
        "maximum_context_length": 2048,
    })

prompt_node = PromptNode(model_name_or_path=model, top_k=1, max_length=256)

The model will be inferred based on the model served on the vLLM server. For more configuration examples, take a look at the unit-tests.

Hosting a vLLM Server

To create an OpenAI-Compatible Server via vLLM you can follow the steps in the Quickstart section of their documentation.

Use a Model Hosted Locally

⚠️To run vLLM locally you need to have vllm installed and a supported GPU.

If you don’t want to use an API-Server this wrapper also provides a vLLMLocalInvocationLayer which executes the vLLM on the same node Haystack is running on.

Here is a simple example of how a PromptNode can be created with the vLLMLocalInvocationLayer.

from haystack.nodes import PromptNode, PromptModel
from vllm_haystack import vLLMLocalInvocationLayer

model = PromptModel(model_name_or_path=MODEL, invocation_layer_class=vLLMLocalInvocationLayer, max_length=256, model_kwargs={
        "maximum_context_length": 2048,
    })

prompt_node = PromptNode(model_name_or_path=model, top_k=1, max_length=256)