ECCV 2026

SVCBench

A Streaming Video Counting Benchmark for Spatial-Temporal State Maintenance

Counting, used as a controlled probe for world-state maintenance: models update object and event counts along the video timeline, so evaluation can localize exactly when and where state tracking breaks.

Abstract ink sketch: a film-strip timeline threading through an isometric wireframe world-state cube, with query points and small objects riding along the timeline.
406
videos
1,000
streaming QA pairs
4,576
timeline query points
10,071
annotated state changes
8
fine-grained subcategories

Pengyiang Liu1,2, Zhongyue Shi1, Hongye Hao1, Qi Fu1, Xueting Bi1, Siwei Zhang1, Xiaoyang Hu1, Zitian Wang3,†, Linjiang Huang1, Si Liu1

  1. 1 Institute of Artificial Intelligence, Beihang University
  2. 2 Meituan
  3. 3 Hangzhou International Innovation Institute, Beihang University

Corresponding author

Colalab Meituan

The problem

Final-answer scoring hides where a model loses the thread.

Long-video understanding is not only about recognizing what happened; a model must continuously maintain internal state as a scene evolves. SVCBench queries that state at many points along the timeline, turning a single graded number into a trajectory that shows when updates are missed and when counts drift.

Comparison of final-answer evaluation, which only scores the last answer, against SVCBench process-level streaming queries that pinpoint when state maintenance fails.
Final-answer evaluation (top) scores only the last answer and cannot localize the error. SVCBench process-level streaming queries (bottom) pinpoint when and where state maintenance fails.

Two families, eight subcategories.

We decompose state maintenance into object counting and event counting, then split each by what must be tracked and how it updates over time. The result is a diagnosable taxonomy rather than one aggregate score.

Object counting

O1

Current-state tracking

O1-Snap counts objects visible at a moment; O1-Delta counts the change across a time window. The count rises and falls in real time.

O2

Cumulative identities

O2-Unique counts distinct individuals seen so far; O2-Gain counts newcomers in a window. The trajectory should never decrease.

Event counting

E1

Instantaneous events

E1-Action counts brief atomic actions; E1-Transit counts hard scene transitions. Each occurrence is a discrete moment to detect.

E2

Activity cycles

E2-Periodic counts complete repetitive cycles; E2-Episode counts semantically complete segments. Both demand boundary detection over time.

Taxonomy of spatial-temporal state maintenance across object and event dimensions with example questions and answers for the eight subcategories.
Each subcategory tracks a different quantity: visible counts (O1), cumulative identities (O2), discrete occurrences (E1), and complete activity cycles (E2).

Built timeline-first, annotated frame by frame.

Videos are drawn from web platforms, existing vision datasets, and physics simulations, then annotated for every event occurrence and object state change. Query points are placed at stable frames across each timeline.

SVCBench construction pipeline from video source selection through frame-level annotation to streaming query-point design.
Construction pipeline: source selection, frame-level annotation of events and state changes, and streaming query-point design.
Dataset statistics by subcategory. Avg. density is the mean number of queried targets per QA pair.
CategoryQuestionsVideosAnnotated eventsAvg. density
Object counting5461962,4834.5
O1-Snap8178--
O1-Delta9875--
O2-Unique2891712,4838.6
O2-Gain7864--
Event counting4542817,58816.7
E1-Action2031461,8939.3
E1-Transit482968214.2
E2-Episode1471001,0337.0
E2-Periodic56563,98071.1

Three metrics for one trajectory.

Streaming queries produce a prediction trajectory, not an isolated answer. Three complementary metrics read different things from it, and together they separate distinct failure modes.

GPA

Gaussian Precision Accuracy

Scores how close each predicted count is to ground truth, with a soft Gaussian tolerance that rewards near-misses and penalizes large outliers.

MoC

Monotonicity Consistency

Checks whether trajectories for cumulative quantities stay non-decreasing, flagging the first point where the prediction regresses.

UDA

Update Detection Accuracy

Measures whether the prediction actually changes at the moments the ground-truth state changes, capturing update timing.

Models stay far below human state maintenance.

Overall GPA on SVCBench, scaled to the human ceiling of 96.1. Even strong models look temporally smooth while missing the precise updates that counting demands. Hover a row for its MoC and UDA.

Model Type Overall GPA

Best model 37.0 GPA versus human 96.1: a gap of 59 points. All models score below 11 on periodic event counting (E2-Periodic), where humans reach 93.2.

Where the failure actually lives.

Two controlled experiments isolate the bottleneck: the issue is grounding visual change in time, not the arithmetic of cumulative counting.

Effect of overlaying explicit cycle-count annotations on E2-Periodic performance, raising GPA from 3.9 to 81.8.
Reasoning is not the bottleneck. Overlaying an explicit cycle count lifts E2-Periodic GPA from 3.9 to 81.8, near the human level of 93.2. Models can maintain a running total; they fail to read cycle boundaries from pixels.
Identity-persistence GPA degrading over 60 rotation cycles for three models while a human stays at 100.
Identity persistence decays over time. With one bench repeatedly entering and leaving a rotating view, model GPA falls from roughly 70 to below 36 by 60 cycles, while a human holds at 100. The same object is mistaken for new ones.

Citation

svcbench.bib
@inproceedings{liu2026svcbench,
  title     = {SVCBench: A Streaming Video Counting Benchmark for
               Spatial-Temporal State Maintenance},
  author    = {Liu, Pengyiang and Shi, Zhongyue and Hao, Hongye and Fu, Qi and
               Bi, Xueting and Zhang, Siwei and Hu, Xiaoyang and Wang, Zitian and
               Huang, Linjiang and Liu, Si},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}