This guide covers how to start different inference engines with various configurations using Inference Engine Arena.
Basic Usage
The basic syntax for starting an engine is:
# Start an engine
arena start < engine_typ e > < model_name_or_pat h > [engine_args]
Where:
<engine_type>
is the type of engine (e.g., vllm
, sglang
)
<model_name_or_path>
is either a Hugging Face model ID or a local path(currently not supported) to a model
[engine_args]
are arguments passed directly to the underlying engine, which is compatible to anything after vllm serve
Environment Variables(optional)
Before starting an engine, you can set environment variables to configure advanced behaviors. These are set using standard shell commands:
# Example setting environment variables before starting an engine
export VLLM_USE_V1 = 1
export HUGGING_FACE_HUB_TOKEN = "YOUR_HUGGING_FACE_TOKEN"
export CUDA_VISIBLE_DEVICES = 1
Starting vLLM
arena start vllm NousResearch/Meta-Llama-3.1-8B --enable-prefix-caching
# Start vLLM with environment variables and engine arguments
export VLLM_USE_V1 = 1
export HUGGING_FACE_HUB_TOKEN = "YOUR_HUGGING_FACE_TOKEN"
export CUDA_VISIBLE_DEVICES = 1
arena start vllm NousResearch/Meta-Llama-3.1-8B --enable-prefix-caching --quantization fp8
Starting SGLang
arena start sglang NousResearch/Meta-Llama-3.1-8B --chunked-prefill-size 2048
# Start sglang with environment variables and engine arguments
export SGL_ENABLE_JIT_DEEPGEMM = 1
export HUGGING_FACE_HUB_TOKEN = "YOUR_HUGGING_FACE_TOKEN"
export CUDA_VISIBLE_DEVICES = 1
arena start sglang NousResearch/Meta-Llama-3.1-8B --enable-torch-compile
Managing Running Engines
To see the status of all running engines:
This will show the engine type, container ID, model, status, and endpoint for each engine.
To view the logs of a running engine:
# Show recent logs
arena logs vllm
# Follow logs in real-time
arena logs vllm --follow
# View a specific number of lines
arena logs vllm --tail 500
To stop a running engine:
# Stop by engine type
arena stop vllm