Get started with Inference Engine Arena in minutes
This guide will help you get up and running with Inference Engine Arena quickly. You’ll learn how to install the framework, start an inference engine, and run a simple benchmark.
If you encounter any issues, please refer to the Troubleshooting guide or submit an issue on GitHub.
After installing the framework, you can start the inference engines and run benchmarks with two main approaches:
Manual mode: Starting engines and running benchmarks separately, designed for benchmarking simple test cases and experimental runs.
Batch mode: Benchmark process initiated by a single command or a single YAML file, suitable for complex and large-scale test case benchmarks, or sharing reproducible experiments with others.
You’ll see logs as the engine starts up. Wait until you see:
Copy
[INFO] src.engines.engine_manager: Started engine vllm[INFO] src.cli.commands: Engine started successfully: vllm
The process will running behind the scene. And you can check the status of running engines:
Copy
arena list
You can see the engine currently running.
Copy
Inference Engines (1):Name Container ID Model Status Endpoint GPU Info--------------------------------------------------------------------------------------------------------------vllm 34547e9b5aa8 NousResearch/Meta-Llama-3 running http://localhost:8000 NVIDIA H100 80GB HBM3
Once your engine is running, you can run a simple benchmark:
Copy
# Run a conversational benchmark on vLLMarena run --engine vllm --benchmark conversational_short conversational_medium summarization
Tips: You may run multiple benchmarks on the same engine. Or multiple benchmarks on different engines at the same time. Refer to Run Benchmarks for more details.
Expected output will show metrics like throughput, ttft, etc. The results will be saved to ./results by default. Refer to the Dashboard and Leaderboard for more details.
Copy
Running benchmark: conversational_short on vllm······src.benchmarks.benchmark_runner: Benchmark completed successfully. Key metrics: input_throughput=975.0938019114691, output_throughput=899.8165604039037, ttft=19.5751617802307······Benchmark Run Summary: Run ID: run-20250420-231844-0e03bab4 Start time: 2025-04-20T23:18:44.583338 End time: 2025-04-20T23:23:17.441123 Duration: 272.9 seconds Engines: vllm Benchmark types: conversational_short, conversational_medium, conversational_long, rewrite_essay, summarization, write_essay Sub-runs: 6Results saved to: ./results/run-20250420-231844-0e03bab4
If you don’t need this engine anymore, or wish to adjust its parameters:
Copy
arena stop vllm
The output below confirms the engine has been stopped successfully:
For more complex scenarios, where you need to benchmark multiple engines with different configurations and different benchmark configurations,you can define your experiments in a YAML file and run them with a single command.
Here, we use /example_yaml/Meta-Llama-3.1-8B-varied-max-num-seq.yaml as an example, which benchmarks the same benchmark type with different max-num-seqs configurations.
Tips: You may also refer to other examples in the /example_yaml directory. And runyaml section for more details.
If you want to test more engines or benchmark types, you can continue adding engine with different configurations and benchmark configurations afterward.
To share your benchmark results with the community, use these commands:
Copy
# Upload signle results to the global leaderboardarena upload sub-run-20250420-211509-vllm-Meta-Llama-3-1-8B-conversational-short-582ae937.json# Upload all results to the global leaderboardarena upload# Anonymous data uploadarena upload --no-login
If you don’t use the --no-login flag, you’ll need to log in to authorize the upload. We recommend starting with a single JSON file upload to complete the login process, then using the command to upload all your data. Alternatively, you can first share your results in the dashboard using the “Share Subrun to Global Leaderboard” button. Don’t worry about duplicate submissions - our system automatically deduplicates any repeated data.