Event Bench: Towards Comprehensive
Benchmarking of Event-based MLLMs

1 Xidian University, 2 Tsinghua University
Paper Code Datasets Model

The comprehensive EventBench covers 8 diverse task metrics for systematically evaluating the capabilities of event-based MLLMs. These metrics can be broadly categorized into three types: understanding (detailed understanding, casual reasoning), recognition (action recognition, gesture recognition, and event OCR), and spatial reasoning (spatial relationship, absolute distance, and object counting).

Abstract

Multimodal large language models (MLLMs) have made significant advancements in event-based vision. However, the comprehensive evaluation of these models' capabilities within a unified benchmark remains a crucial yet largely unexplored aspect. In this paper, we introduce EventBench, which presents an evaluation benchmark covering 8 diverse task metrics and a large-scale event stream dataset. Our work distinguishes existing event-based benchmarks in four key aspects: i) Openness in accessibility, releasing all raw event streams and task instructions across 8 evaluation metrics; ii) Diversity in task coverage, covering diverse understanding, recognition, and spatial reasoning tasks, enabling comprehensive evaluation of model capability; iii) Integration in spatial dimensions, pioneering the design of 3D spatial reasoning task for event-based MLLMs; iv) Scale in data volume, with an accompanying training set of over one million event–text pairs supporting large-scale training and evaluation. With EventBench, we evaluate state-of-the-art closed-source models such as GPT-5 and Gemini-2.5 Pro, leading open-source models including Qwen2.5-VL and InternVL3, as well as event-based MLLMs like EventGPT that directly process raw event inputs. Extensive evaluation reveals that current event-based MLLMs perform well in event stream understanding, yet they still struggle with fine-grained recognition and spatial reasoning.

Main Results on EventBench

Radar Performance Graph
Model Params Backbone Understanding Recognition Spatial Reasoning Overall
DUCR ARGRE-OCR SRADOC
● Close-Source MLLMs
GPT-5-- 68.570.1 46.049.822.2 47.830.725.6 45.1
GPT-4o-- 62.565.8 55.348.619.1 45.517.821.5 42.0
Gemini-2.5 Pro-- 61.363.9 44.149.919.1 40.517.516.9 39.2
Gemini-2.5 Flash-- 54.761.1 42.743.421.0 25.615.614.5 34.8
Doubao-Seed-1.6 Vision-- 70.365.2 46.645.445.1 11.636.323.0 42.9
● Open-Source MLLMs (SFT)
Qwen2.5-VL3BQwen2.5 59.865.6 52.935.245.6 44.931.959.9 49.5
InternVL3.54BQwen3 73.175.0 69.345.238.9 40.546.861.1 56.2
MiniCPM-V-44.1BMiniCPM4 72.963.9 58.937.144.4 44.243.155.6 52.5
MiMo-VL7BMiMo 71.868.3 57.338.631.5 44.545.261.0 52.3
● Open-Source Event-Based MLLMs
EventGPT7BVicuna-1.5 76.879.2 78.657.252.4 49.852.661.5 63.5
EventGPT+(B)2BQwen2.5 78.378.2 79.855.553.6 50.251.962.8 63.8
EventGPT+(L)7BQwen2.5 79.179.6 81.658.755.4 53.252.264.6 65.5

Dataset Statistics

Dataset Statistics

Data statistics of our EventBench. (a) Task and category distribution across three groups: understanding (i.e., DU and CR), recognition (i.e., AR, GR, and E-OCR), and spatial reasoning (i.e., SR, AD, and OC). (b) Sample statistics for each task category. (c) Comparison with existing event-based benchmarks across multiple dimensions (i.e., modality, metric type, data source, size, and temporal span). Note that EventBench provides a comprehensive benchmark for systematically evaluating the capabilities of event-based MLLMs.

BibTeX

@article{liu2025eventbench,
  title={EventBench: Towards Comprehensive Benchmarking of Event-based MLLMs},
  author={Liu, Shaoyu and Li, Jianing and Zhao, Guanghui and Zhang, Yunjian and Ji, Xiangyang},
  journal={arXiv preprint arXiv:2511.18448},
  year={2025}
}