Llama 3.3 on vLLM with Speculative Decoding
You can now double the tokens/s output speed with speculative decoding in vLLM. However, it is currently incompatible with JSON output mode.
๐ In Short
Use this:
python3 -m vllm.entrypoints.openai.api_server \
--model nm-testing/Llama-3.3-70B-Instruct-FP8-dynamic \
--speculative-model neuralmagic/Llama-3.2-3B-Instruct-FP8-dynamic \
--num-speculative-tokens 5
The tokens/s
increased from ~30
to ~60
on an H100 when I benchmarked.
However response_format: json_object
is currently incompatible: https://github.com/vllm-project/vllm/issues/9423
๐ Speculative Decoding - New vLLM Feature
Speculative decoding has ish arrived in vLLM: https://docs.vllm.ai/en/latest/usage/spec_decode.html
The feature can already be used, but they warn that it is a work in progress.
Basically, a smaller model is used to generate tokens. The larger model validates these tokens, which is a faster operation, and will only generate tokens itself if the smaller model fails to produce the correct tokens.
Let's take it for a spin!
๐ป Testing Setup on Vast.AI
๐ข I don't own an H100. Neither do you (probably).
๐ค LUCKILY: we can both rent them on https://vast.ai/ for experimentation.
Here's the template data that I used in my experiments:
- Image Path:Tag:
vllm/vllm-openai:latest
- Docker Options:
--runtime nvidia --gpus all -p 8000:8000 --ipc=host
- Select Launch Mode:
Interactive shell server, SSH
- Extra Filters:
cpu_arch in ['amd64']
- Disk Space:
200GB
Notably, I left the On-start Script empty. This is actually really convenient when experimenting. Instead just SSH into the machine and run the commands. You can then use CTRL+C
to stop the vLLM server and change arguments easily. The machine IP and PORT will remain the same and models will remain cached on disk etc.
Here's a screenshot of what that template looks like:
Next, find yourself some machine with a single H100 card and 96G VRAM:
Click that blue > Connect
button to obtain the SSH command:
You can now use a command like this to start vLLM:
python3 -m vllm.entrypoints.openai.api_server \
--api-key abc6356fce95ebb702f7 \
--max-model-len 8192 \
--model nm-testing/Llama-3.3-70B-Instruct-FP8-dynamic
Explanation:
--api-key abc6356fce95ebb702f7
This adds basic security to the OpenAI compatible API.--max-model-len 8192
Enforces a smaller maximum context length so that it fits in H100 VRAM.--model nm-testing/Llama-3.3-70B-Instruct-FP8-dynamic
An experimental NeuralMagic FP8 model for Llama 3.3.
It will take a while to download the model and get started. You'll know it is fully operational when you see:
INFO: Started server process [7294]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
Like at the bottom of this screenshot:
You can then use a curl
command like this to test it:
REQUEST_JSON="{
\"model\": \"nm-testing/Llama-3.3-70B-Instruct-FP8-dynamic\",
\"temperature\": 0.5,
\"max_tokens\": 3000,
\"messages\": [
{
\"role\": \"system\",
\"content\": \"You are a helpful assistant.\"
},
{
\"role\": \"user\",
\"content\": \"Tell me a joke.\"
}
]
}"
curl --silent --url "http://38.99.105.118:20012/v1/chat/completions" \
--header "Authorization: Bearer abc6356fce95ebb702f7" \
--header "Content-Type: application/json" \
--data-raw "${REQUEST_JSON}" \
| jq
โ ๏ธ WARNING: Plain un-encrypted HTTP. You'll need something better for your production workloads. This is just for testing!
๐คก You'll get a response like this:
๐งช Adding SpecDec
๐๏ธ Time to add speculative decoding!
- Main Model: https://huggingface.co/nm-testing/Llama-3.3-70B-Instruct-FP8-dynamic
(Beta release from NeuralMagic of Llama 3.3 70B FP8) - Small Model: https://huggingface.co/neuralmagic/Llama-3.2-3B-Instruct-FP8-dynamic
(Stable release from NeuralMagic of Llama 3.2 3B FP8)
The new command looks like this:
python3 -m vllm.entrypoints.openai.api_server \
--api-key abc6356fce95ebb702f7 \
--max-model-len 8192 \
--model nm-testing/Llama-3.3-70B-Instruct-FP8-dynamic \
--speculative-model neuralmagic/Llama-3.2-3B-Instruct-FP8-dynamic \
--num-speculative-tokens 5
So, we just added 2 args at the end:
--speculative-model neuralmagic/Llama-3.2-3B-Instruct-FP8-dynamic
The smaller model to use for speculation.--num-speculative-tokens 5
Speculate 5 tokens at a time.
Once up and running again you can run that curl command again to test.
Keep an eye out for the Speculative metrics
:
Speculative metrics: Draft acceptance rate: 0.800, System efficiency: 0.333, Number of speculative tokens: 5, Number of accepted tokens: 4, Number of draft tokens: 5, Number of emitted tokens: 2.
Neat! This means 80%
of the tokens produced by the smaller model matched the larger model! In practice a x2 performance boost according to my benchmarks.
๐งช Benchmarking and Other Speculation Configurations
I benchmarked with ~20 customn prompts in Swedish. The use case in those prompts were largely summarization (good use case for SpecDec).
The settings above were the best ones I managed to find for Llama 3.3 SpecDec.
I also tried these other speculation configurations:
--speculative-model shuyuej/Llama-3.2-3B-Instruct-GPTQ
Draft acceptance rate dropped from 80% to 74% with this 4-bit quant. The smaller model size did not make up for that. Performance was slightly worse in the end.--speculative-model ibm-fms/llama3-70b-accelerator
Draft acceptance rate dropped from 80% to 1%. This example from the docs did not work well at all.--speculative-model "[ngram]" --ngram-prompt-lookup-max 4
Draft acceptance rate dropped from 80% to 46%. Faster than without speculation but slower than using neuralmagic/Llama-3.2-3B-Instruct-FP8-dynamic.
During my benchmarking I also found that response_format: json_object
makes vLLM crash if SpecDec is enabled. I suspect this is the related issue: https://github.com/vllm-project/vllm/issues/9423
๐ก Conclusions
๐ก The performance increase from 30
to 60
tokens/s for Llama 3.3 70B on H100 with vLLM SpecDec (recent beta release) is going to be really meaningful for more than one reason:
๐ 30 tokens/s felt a bit slow and detrimental to UX in my opinion. With 60 tokens/s it actually feels usable. It is like this speed bump made it past an important boundary. Are you building something where the user has to wait? SpecDec in vLLM is probably going to be an enabler for you.
๐คก Elon Musk has bought all NVIDIA Blackwell production for the upcoming few months. This means most of us are stuck on H100 cards for a while longer. Fantastic that a software change can double the tokens/s output!
โณ Once SpecDec in vLLM stabilizes (and response_format: json_object
works again) the open source ecosystem will seem somewhat complete for those that want to abandon OpenAI GPT-4 and the US-owned cloud. The timeline looks like this:
- Llama 3.1 70B (Summer 2024): The first multilingual open model viable for weird languages such as Swedish was released. Not quite GPT-4o level though.
- Llama 3.3 70B (Winter 2024): Just as good as GPT-4o. Performance on par with Llama 3.1 405B but now fits on a single H100.
- vLLM SpecDec (Spring 2025): Performance becomes "good enough" on H100.
๐ Looking forward to the open LLM ecosystem maturing even further next year. Will further software optimizations + Llama 4 make RTX 5090 32GB the "new H100 for inference" in 2025? Likely? Happy new year!