The comprehensive EventBench covers 8 diverse task metrics for systematically evaluating the capabilities of event-based MLLMs. These metrics can be broadly categorized into three types: understanding (detailed understanding, casual reasoning), recognition (action recognition, gesture recognition, and event OCR), and spatial reasoning (spatial relationship, absolute distance, and object counting).
Abstract
Multimodal large language models (MLLMs) have made significant advancements in event-based vision. However, the comprehensive evaluation of these models' capabilities within a unified benchmark remains a crucial yet largely unexplored aspect. In this paper, we introduce EventBench, which presents an evaluation benchmark covering 8 diverse task metrics and a large-scale event stream dataset. Our work distinguishes existing event-based benchmarks in four key aspects: i) Openness in accessibility, releasing all raw event streams and task instructions across 8 evaluation metrics; ii) Diversity in task coverage, covering diverse understanding, recognition, and spatial reasoning tasks, enabling comprehensive evaluation of model capability; iii) Integration in spatial dimensions, pioneering the design of 3D spatial reasoning task for event-based MLLMs; iv) Scale in data volume, with an accompanying training set of over one million event–text pairs supporting large-scale training and evaluation. With EventBench, we evaluate state-of-the-art closed-source models such as GPT-5 and Gemini-2.5 Pro, leading open-source models including Qwen2.5-VL and InternVL3, as well as event-based MLLMs like EventGPT that directly process raw event inputs. Extensive evaluation reveals that current event-based MLLMs perform well in event stream understanding, yet they still struggle with fine-grained recognition and spatial reasoning.
Main Results on EventBench
| Model | Params | Backbone | Understanding | Recognition | Spatial Reasoning | Overall | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| DU | CR | AR | GR | E-OCR | SR | AD | OC | ||||
| ● Close-Source MLLMs | |||||||||||
| GPT-5 | - | - | 68.5 | 70.1 | 46.0 | 49.8 | 22.2 | 47.8 | 30.7 | 25.6 | 45.1 |
| GPT-4o | - | - | 62.5 | 65.8 | 55.3 | 48.6 | 19.1 | 45.5 | 17.8 | 21.5 | 42.0 |
| Gemini-2.5 Pro | - | - | 61.3 | 63.9 | 44.1 | 49.9 | 19.1 | 40.5 | 17.5 | 16.9 | 39.2 |
| Gemini-2.5 Flash | - | - | 54.7 | 61.1 | 42.7 | 43.4 | 21.0 | 25.6 | 15.6 | 14.5 | 34.8 |
| Doubao-Seed-1.6 Vision | - | - | 70.3 | 65.2 | 46.6 | 45.4 | 45.1 | 11.6 | 36.3 | 23.0 | 42.9 |
| ● Open-Source MLLMs (SFT) | |||||||||||
| Qwen2.5-VL | 3B | Qwen2.5 | 59.8 | 65.6 | 52.9 | 35.2 | 45.6 | 44.9 | 31.9 | 59.9 | 49.5 |
| InternVL3.5 | 4B | Qwen3 | 73.1 | 75.0 | 69.3 | 45.2 | 38.9 | 40.5 | 46.8 | 61.1 | 56.2 |
| MiniCPM-V-4 | 4.1B | MiniCPM4 | 72.9 | 63.9 | 58.9 | 37.1 | 44.4 | 44.2 | 43.1 | 55.6 | 52.5 |
| MiMo-VL | 7B | MiMo | 71.8 | 68.3 | 57.3 | 38.6 | 31.5 | 44.5 | 45.2 | 61.0 | 52.3 |
| ● Open-Source Event-Based MLLMs | |||||||||||
| EventGPT | 7B | Vicuna-1.5 | 76.8 | 79.2 | 78.6 | 57.2 | 52.4 | 49.8 | 52.6 | 61.5 | 63.5 |
| EventGPT+(B) | 2B | Qwen2.5 | 78.3 | 78.2 | 79.8 | 55.5 | 53.6 | 50.2 | 51.9 | 62.8 | 63.8 |
| EventGPT+(L) | 7B | Qwen2.5 | 79.1 | 79.6 | 81.6 | 58.7 | 55.4 | 53.2 | 52.2 | 64.6 | 65.5 |
Dataset Statistics
Data statistics of our EventBench. (a) Task and category distribution across three groups: understanding (i.e., DU and CR), recognition (i.e., AR, GR, and E-OCR), and spatial reasoning (i.e., SR, AD, and OC). (b) Sample statistics for each task category. (c) Comparison with existing event-based benchmarks across multiple dimensions (i.e., modality, metric type, data source, size, and temporal span). Note that EventBench provides a comprehensive benchmark for systematically evaluating the capabilities of event-based MLLMs.
BibTeX
@article{liu2025eventbench,
title={EventBench: Towards Comprehensive Benchmarking of Event-based MLLMs},
author={Liu, Shaoyu and Li, Jianing and Zhao, Guanghui and Zhang, Yunjian and Ji, Xiangyang},
journal={arXiv preprint arXiv:2511.18448},
year={2025}
}