MMSpec logo MMSpec

Benchmarking Speculative Decoding for Vision-Language Models

1University of Michigan, 2The Ohio State University, 3The University of Hong Kong,
4Indiana University, 5Peking University, 6Carnegie Mellon University, 7LMSYS Org
*Equal contribution.

MMSpec is the first benchmark dedicated to speculative decoding for vision-language models. It contains 600 multimodal samples from 6 task categories, integrates 10 representative speculative decoding algorithms under a unified framework, and introduces ViSkip, a plug-and-play method that adaptively disables drafting on vision-critical decoding steps.

MMSpec radar comparison

Performance comparison of speculative decoding methods with Qwen2.5-VL-7B on MMSpec. Vision-aware methods are consistently stronger, and ViSkip is designed from the benchmark findings.

Abstract

Vision-language models achieve strong performance on multimodal tasks but still suffer from high inference latency due to large model sizes and long multimodal contexts. MMSpec benchmarks speculative decoding in this setting with a unified evaluation protocol over six representative multimodal workloads. The benchmark reveals three core findings: methods built for text-only LLMs degrade in multimodal scenarios, vision awareness becomes increasingly important at larger batch sizes, and throughput speedup alone does not reliably reflect latency performance.

Motivated by these results, the project introduces ViSkip, a vision-aware speculative decoding strategy that estimates whether the current decoding step depends heavily on image tokens and dynamically switches between speculative decoding and standard autoregressive decoding.

MMSpec logo Benchmark Design

The MMSpec Benchmark

MMSpec is built for fair third-party comparison of multimodal speculative decoding methods. All methods are evaluated on the same device setup, in the same software environment, with the same measurement protocol. The benchmark is designed around workload diversity, balanced topic coverage, explicit multi-turn support, and method-agnostic measurement.

Sampled data from MMSpec

Sampled data across the six MMSpec task categories.

General VQA

100 samples from GQA.

Avg. output: 46.98 tokens

Text VQA

100 samples from TextVQA.

Avg. output: 63.15 tokens

Image Captioning

100 samples from COCO.

Avg. output: 191.90 tokens

Chart VQA

100 samples from CharXiv.

Avg. output: 68.56 tokens

Complex Reasoning

100 samples from MMMU-Pro.

Avg. output: 285.60 tokens

Multi-turn Conversation

100 samples from ConvBench and MM-MT-Bench.

Avg. output: 747.65 tokens

Algorithms Under Comparison

MMSpec unifies ten representative lossless speculative decoding methods, spanning training-based and training-free approaches. This makes the benchmark suitable for apples-to-apples comparison rather than isolated case studies.

Overview of speculative decoding algorithms

Overview of speculative decoding algorithms evaluated in the MMSpec framework.

Method Key Idea Drafting Vision Awareness Category
ViSpec Vision-token compression for efficient multimodal drafting. Linear Vision-aware Training-based
MSD Train a multimodal draft model with staged VLM training. Linear Vision-aware Training-based
EAGLE-1 / 2 / 3 Feature-level or token-level drafting from target hidden states. Linear / Tree Vision-agnostic Training-based
Medusa Multi-head tree proposals from a single forward pass. Tree Vision-agnostic Training-based
SAM Decoding Suffix-automaton continuation retrieval for draft generation. Linear Vision-agnostic Training-free
Lookahead Trie-based retrieval of multi-token continuations with tree verification. Tree Vision-agnostic Training-free
Recycling Reuses discarded candidates as speculative draft tree nodes. Tree Vision-agnostic Training-free
PLD Prompt n-gram lookup replaces an external draft model. Linear Vision-agnostic Training-free

MMSpec logo Experimental Findings

Finding 1

Model-free methods provide very limited gains and sometimes slow down multimodal decoding.

Finding 2

Training-based methods that ignore visual information still underperform in VLM inference.

Finding 3

Throughput speedup alone is not enough. Stable latency behavior matters for real deployments.

Overall Comparison

The main evaluations are conducted on Qwen2.5-VL-7B-Instruct and LLaVA-1.5-7B. MMSpec reports both Mean Accepted Tokens (MAT) and Walltime Speedup Ratio, highlighting that speculative decoding should be judged from both token acceptance and end-to-end latency efficiency.

Qwen2.5-VL-7B

2.58×

Best overall speedup in the main table, achieved by MSD.

LLaVA-1.5-7B

2.58×

Best overall speedup in the main table, achieved by ViSpec.

Benchmark Scale

10

Representative speculative decoding algorithms evaluated under one framework.

Batch Size Sensitivity

Batch size comparison

Vision-aware methods remain the most robust as batch size increases, while non-vision-aware methods degrade more sharply.

Latency Analysis

Latency CDF comparison

Latency CDFs show that methods with higher average throughput do not always yield the best or most stable wall-clock behavior.

MMSpec logo ViSkip

A Plug-and-Play Vision-Aware Method

ViSkip dynamically alternates between standard autoregressive decoding and speculative drafting according to the current token state's visual relevance. At each decoding step, it computes cross-attention between the decoder hidden state and visual tokens, extracts a visual relevance score, and only enables speculation when that score is below a threshold.

In short: when the next token strongly depends on the image, skip speculative drafting; otherwise, draft aggressively and verify with the full model.

ViSkip batch size results

Better Scaling

ViSkip improves speedup trends across batch sizes when combined with existing methods.

ViSkip latency CDF

Lower Latency

Latency CDFs shift left after integrating ViSkip, indicating faster completion for most samples.

ViSkip latency breakdown

Target-Model Savings

Latency breakdowns show that ViSkip primarily reduces expensive full-model verification time.

The central takeaway of ViSkip is simple: speculative decoding is least reliable exactly when generation is most grounded in visual evidence. By detecting these steps and skipping drafting only when needed, ViSkip improves existing methods without changing their core speculative decoding mechanics.

BibTeX

@article{shen2025mmspec,
  title={MMSpec: Benchmarking Speculative Decoding for Vision-Language Models},
  author={Hui Shen and Xin Wang and Ping Zhang and Yunta Hsieh and Qi Han and Zhongwei Wan and Ziheng Zhang and Jingxuan Zhang and Jing Xiong and Ziyuan Liu and Yifan Zhang and Hangrui Cao and Chenyang Zhao and Mi Zhang},
  year={2025},
  note={Preprint}
}