vLLM

Deploy vLLM to easily run inference on self-hosted AI models


Overview

vLLM is a Python library that functions as a hosted LLM inference platform. It can download models from Hugging Face and run them seamlessly on local GPUs. vLLM can run batched offline inference on datasets, provides an OpenAI-compatible API to respond to client requests, and can be tuned according to the hardware available and the performance characteristics you require.

This Starter deploys vLLM to Koyeb in one click. By default, it deploys on an Nvidia RTX 4000 SFF Ada GPU Instance using Google's Gemma 2B model. You can change the model during deployment by modifying the Command args in the Deployment section.

Requirements

  • Access to Koyeb GPU instances. Join the preview today!
  • A Hugging Face account to accept the terms and conditions for the model you plan to use. Accept the usage license for the Gemma 2B model or whatever alternative model you plan to use and generate a read-only token to use the configured model. You may also have to request access to certain models.

Configuration

You must run vLLM on a GPU Instance type.

During initialization, vLLM will download the specified model from Hugging Face. In the Health checks section of the configuration page, set the Grace period to 300 to allow time to download large models.

To change the deployed model, in the Deployment section, modify the selected model in the Command args field.

When deploying vLLM on Koyeb, the following environment variables are used for configuration. Take care to set the required variables with the appropriate values if not set:

  • HF_TOKEN: An API token to authenticate to Hugging Face. This app only requires a read-only API token and is used to verify that you have accepted the model's usage license.
  • VLLM_API_KEY (Optional): An API key you can set to limit access to the server. When an API key is set, every request must provide it as an authorization bearer token.
  • VLLM_DO_NOT_TRACK (Optional): Set to "1" to disable sending usage statistics to the vLLM project.

Other resources related to vLLM

Related One-Click Apps in this category

  • DeepSparse Server

    DeepSparse is an inference runtime taking advantage of sparsity with neural networks offering GPU-class performance on CPUs.

  • Fooocus

    Deploy Fooocus, a powerful AI image generation tool, on Koyeb

  • LangServe

    LangServe makes it easy to deploy LangChain applications as RESTful APIs.

The fastest way to deploy applications globally.