ENCODE Lab @ CVPR 2026

01

OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal LLMs

CVPR 2026 · Token Compression

SAT · Jun 6 · 11:45 AM–1:45 PM | ExHall F · Poster #320

A training-free, “listen-to-prune” framework for omnimodal LLMs: salient audio tokens guide the pruning of video tokens, so audio and video are compressed jointly. Up to ~3.4× faster inference and ~1.4× lower memory with near-baseline accuracy and no retraining.

Keda Tao, Kele Shao, Bohan Yu, Weiqiang Wang, Jian Liu, Huan Wang

▸ArXiv ▸Code

02

EarlyTom: Early Token Compression Completes Fast Video Understanding

CVPR 2026 · Token Compression

SUN · Jun 7 · 3:30–5:30 PM | ExHall A · Poster #405

Vision encoding itself dominates time-to-first-token, so EarlyTom does training-free token compression early — inside the vision encoder — rather than only late in prefilling. With a decoupled spatial token-selection strategy it cuts TTFT by up to 2.65× and FLOPs by up to 61% on LLaVA-OneVision-7B at near-full-token accuracy.

Hesong Wang*, Xin Jin*, Lu Lu, Chenhaowen Li, Jian Chen, Qiang Liu, Huan Wang *equal contribution

▸ArXiv ▸Code ▸Webpage

03

StreamingTOM: Streaming Token Compression for Efficient Video Understanding

CVPR 2026 · Streaming Video

SAT · Jun 6 · 4:45–6:45 PM | ExHall A & F · Poster #302

A training-free, two-stage framework for streaming video. Causal Temporal Reduction caps per-frame tokens before the LLM; Online Quantized Memory keeps the KV-cache in 4-bit with on-demand retrieval. ~15.7× KV-cache compression and ~2× faster time-to-first-token vs. prior SOTA, with memory bounded regardless of stream length.

Xueyi Chen, Keda Tao, Kele Shao, Huan Wang

▸ArXiv ▸Code ▸Webpage

04

QVGGT: Post-Training Quantized Visual Geometry Grounded Transformer

CVPR 2026 · Quantization · 3D

FRI · Jun 5 · 4:00–6:00 PM | ExHall A & F · Poster #34

The first post-training quantization framework for VGGT-style feed-forward 3D reconstruction. Per-block sensitivity drives mixed-precision allocation; token filtering with camera-information compensation tames outlier camera/register tokens; a task-aware scale search preserves cross-head geometric consistency. Near-lossless W4A16 with 3–4.9× memory reduction and up to 2.8× speedup — bringing 3D perception to edge devices.

Zhizhen Pan, Hesong Wang, Huan Wang

▸Code ▸Webpage

05

Parallel Jacobi Decoding for Fast Autoregressive Image Generation

CVPR 2026 · Generative · Fast Inference

FRI · Jun 5 · 4:00–6:00 PM | ExHall A & F · Poster #171

Autoregressive image models are slow because they emit tokens one at a time. PJD reframes decoding as a Jacobi iteration that refines many tokens in parallel, cutting the number of sequential forward passes needed per image and speeding up generation.

Boya Liao, Ying Li, Siyong Jian, Huan Wang

▸Webpage

06

ReasonMap: Towards Fine-Grained Visual Reasoning from Transit Maps

CVPR 2026 · Visual Reasoning Benchmark

SUN · Jun 7 · 3:30–5:30 PM | ExHall A · Poster #454

A benchmark of 1,008 human-verified QA pairs over high-resolution transit maps from 30 cities in 13 countries, jointly probing fine-grained visual understanding and spatial reasoning. A two-level metric scores answer correctness and route quality; evaluating ~16 MLLMs surfaces a counterintuitive open- vs. closed-source reasoning-model gap.

Sicheng Feng, Song Wang, Shuyi Ouyang, Lingdong Kong, Zikai Song, Jianke Zhu, Huan Wang, Xinchao Wang

▸ArXiv ▸Code ▸Webpage

Thin air, dense ideas.
See you in altitude.

OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal LLMs

EarlyTom: Early Token Compression Completes Fast Video Understanding

StreamingTOM: Streaming Token Compression for Efficient Video Understanding

QVGGT: Post-Training Quantized Visual Geometry Grounded Transformer

Parallel Jacobi Decoding for Fast Autoregressive Image Generation

ReasonMap: Towards Fine-Grained Visual Reasoning from Transit Maps

Meet our encoders at CVPR’26.