Llama 3.3 on vLLM with Speculative Decoding



You can now double the tokens/s output speed with speculative decoding in vLLM. However, it is currently incompatible with JSON output mode.

๐Ÿ“ In Short

Use this:

python3 -m vllm.entrypoints.openai.api_server \
--model nm-testing/Llama-3.3-70B-Instruct-FP8-dynamic \
--speculative-model neuralmagic/Llama-3.2-3B-Instruct-FP8-dynamic \
--num-speculative-tokens 5

The tokens/s increased from ~30 to ~60 on an H100 when I benchmarked.

However response_format: json_object is currently incompatible: https://github.com/vllm-project/vllm/issues/9423

๐Ÿ†• Speculative Decoding - New vLLM Feature

Speculative decoding has ish arrived in vLLM: https://docs.vllm.ai/en/latest/usage/spec_decode.html

The feature can already be used, but they warn that it is a work in progress.

Basically, a smaller model is used to generate tokens. The larger model validates these tokens, which is a faster operation, and will only generate tokens itself if the smaller model fails to produce the correct tokens.

Let's take it for a spin!

๐Ÿ’ป Testing Setup on Vast.AI

๐Ÿ˜ข I don't own an H100. Neither do you (probably).

๐Ÿค LUCKILY: we can both rent them on https://vast.ai/ for experimentation.

Here's the template data that I used in my experiments:

Notably, I left the On-start Script empty. This is actually really convenient when experimenting. Instead just SSH into the machine and run the commands. You can then use CTRL+C to stop the vLLM server and change arguments easily. The machine IP and PORT will remain the same and models will remain cached on disk etc.

Here's a screenshot of what that template looks like:

Next, find yourself some machine with a single H100 card and 96G VRAM:

Click that blue > Connect button to obtain the SSH command:

You can now use a command like this to start vLLM:

python3 -m vllm.entrypoints.openai.api_server \
--api-key abc6356fce95ebb702f7 \
--max-model-len 8192 \
--model nm-testing/Llama-3.3-70B-Instruct-FP8-dynamic

Explanation:

It will take a while to download the model and get started. You'll know it is fully operational when you see:

INFO:     Started server process [7294]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Like at the bottom of this screenshot:

You can then use a curl command like this to test it:

REQUEST_JSON="{
    \"model\": \"nm-testing/Llama-3.3-70B-Instruct-FP8-dynamic\",
    \"temperature\": 0.5,
    \"max_tokens\": 3000,
    \"messages\": [
        {
            \"role\": \"system\",
            \"content\": \"You are a helpful assistant.\"
        },
        {
            \"role\": \"user\",
            \"content\": \"Tell me a joke.\"
        }
    ]
}"

curl --silent --url "http://38.99.105.118:20012/v1/chat/completions" \
    --header "Authorization: Bearer abc6356fce95ebb702f7" \
    --header "Content-Type: application/json" \
    --data-raw "${REQUEST_JSON}" \
    | jq

โš ๏ธ WARNING: Plain un-encrypted HTTP. You'll need something better for your production workloads. This is just for testing!

๐Ÿคก You'll get a response like this:

๐Ÿงช Adding SpecDec

๐ŸŽ๏ธ Time to add speculative decoding!

The new command looks like this:

python3 -m vllm.entrypoints.openai.api_server \
--api-key abc6356fce95ebb702f7 \
--max-model-len 8192 \
--model nm-testing/Llama-3.3-70B-Instruct-FP8-dynamic \
--speculative-model neuralmagic/Llama-3.2-3B-Instruct-FP8-dynamic \
--num-speculative-tokens 5

So, we just added 2 args at the end:

Once up and running again you can run that curl command again to test.

Keep an eye out for the Speculative metrics:

Speculative metrics: Draft acceptance rate: 0.800, System efficiency: 0.333, Number of speculative tokens: 5, Number of accepted tokens: 4, Number of draft tokens: 5, Number of emitted tokens: 2.

Neat! This means 80% of the tokens produced by the smaller model matched the larger model! In practice a x2 performance boost according to my benchmarks.

๐Ÿงช Benchmarking and Other Speculation Configurations

I benchmarked with ~20 customn prompts in Swedish. The use case in those prompts were largely summarization (good use case for SpecDec).

The settings above were the best ones I managed to find for Llama 3.3 SpecDec.

I also tried these other speculation configurations:

During my benchmarking I also found that response_format: json_object makes vLLM crash if SpecDec is enabled. I suspect this is the related issue: https://github.com/vllm-project/vllm/issues/9423

๐Ÿ’ก Conclusions

๐Ÿ’ก The performance increase from 30 to 60 tokens/s for Llama 3.3 70B on H100 with vLLM SpecDec (recent beta release) is going to be really meaningful for more than one reason:

๐ŸŒ 30 tokens/s felt a bit slow and detrimental to UX in my opinion. With 60 tokens/s it actually feels usable. It is like this speed bump made it past an important boundary. Are you building something where the user has to wait? SpecDec in vLLM is probably going to be an enabler for you.

๐Ÿคก Elon Musk has bought all NVIDIA Blackwell production for the upcoming few months. This means most of us are stuck on H100 cards for a while longer. Fantastic that a software change can double the tokens/s output!

โณ Once SpecDec in vLLM stabilizes (and response_format: json_object works again) the open source ecosystem will seem somewhat complete for those that want to abandon OpenAI GPT-4 and the US-owned cloud. The timeline looks like this:

  1. Llama 3.1 70B (Summer 2024): The first multilingual open model viable for weird languages such as Swedish was released. Not quite GPT-4o level though.
  2. Llama 3.3 70B (Winter 2024): Just as good as GPT-4o. Performance on par with Llama 3.1 405B but now fits on a single H100.
  3. vLLM SpecDec (Spring 2025): Performance becomes "good enough" on H100.

๐ŸŒŸ Looking forward to the open LLM ecosystem maturing even further next year. Will further software optimizations + Llama 4 make RTX 5090 32GB the "new H100 for inference" in 2025? Likely? Happy new year!