Event Bench: Towards Comprehensive

Benchmarking of Event-based MLLMs

Shaoyu Liu^1,2, Jianing Li², Guanghui Zhao¹, Yunjian Zhang², Xiangyang Ji²

¹ Xidian University, ² Tsinghua University

Paper Code

Datasets

Model

The comprehensive EventBench covers 8 diverse task metrics for systematically evaluating the capabilities of event-based MLLMs. These metrics can be broadly categorized into three types: understanding (detailed understanding, casual reasoning), recognition (action recognition, gesture recognition, and event OCR), and spatial reasoning (spatial relationship, absolute distance, and object counting).

Abstract

Multimodal large language models (MLLMs) have made significant advancements in event-based vision. However, the comprehensive evaluation of these models' capabilities within a unified benchmark remains a crucial yet largely unexplored aspect. In this paper, we introduce EventBench, which presents an evaluation benchmark covering 8 diverse task metrics and a large-scale event stream dataset. Our work distinguishes existing event-based benchmarks in four key aspects: i) Openness in accessibility, releasing all raw event streams and task instructions across 8 evaluation metrics; ii) Diversity in task coverage, covering diverse understanding, recognition, and spatial reasoning tasks, enabling comprehensive evaluation of model capability; iii) Integration in spatial dimensions, pioneering the design of 3D spatial reasoning task for event-based MLLMs; iv) Scale in data volume, with an accompanying training set of over one million event–text pairs supporting large-scale training and evaluation. With EventBench, we evaluate state-of-the-art closed-source models such as GPT-5 and Gemini-2.5 Pro, leading open-source models including Qwen2.5-VL and InternVL3, as well as event-based MLLMs like EventGPT that directly process raw event inputs. Extensive evaluation reveals that current event-based MLLMs perform well in event stream understanding, yet they still struggle with fine-grained recognition and spatial reasoning.

Main Results on EventBench

Model	Params	Backbone	Understanding		Recognition			Spatial Reasoning			Overall
Model	Params	Backbone	DU	CR	AR	GR	E-OCR	SR	AD	OC	Overall
● Close-Source MLLMs
GPT-5	-	-	68.5	70.1	46.0	49.8	22.2	47.8	30.7	25.6	45.1
GPT-4o	-	-	62.5	65.8	55.3	48.6	19.1	45.5	17.8	21.5	42.0
Gemini-2.5 Pro	-	-	61.3	63.9	44.1	49.9	19.1	40.5	17.5	16.9	39.2
Gemini-2.5 Flash	-	-	54.7	61.1	42.7	43.4	21.0	25.6	15.6	14.5	34.8
Doubao-Seed-1.6 Vision	-	-	70.3	65.2	46.6	45.4	45.1	11.6	36.3	23.0	42.9
● Open-Source MLLMs (SFT)
Qwen2.5-VL	3B	Qwen2.5	59.8	65.6	52.9	35.2	45.6	44.9	31.9	59.9	49.5
InternVL3.5	4B	Qwen3	73.1	75.0	69.3	45.2	38.9	40.5	46.8	61.1	56.2
MiniCPM-V-4	4.1B	MiniCPM4	72.9	63.9	58.9	37.1	44.4	44.2	43.1	55.6	52.5
MiMo-VL	7B	MiMo	71.8	68.3	57.3	38.6	31.5	44.5	45.2	61.0	52.3
● Open-Source Event-Based MLLMs
EventGPT	7B	Vicuna-1.5	76.8	79.2	78.6	57.2	52.4	49.8	52.6	61.5	63.5
EventGPT+(B)	2B	Qwen2.5	78.3	78.2	79.8	55.5	53.6	50.2	51.9	62.8	63.8
EventGPT+(L)	7B	Qwen2.5	79.1	79.6	81.6	58.7	55.4	53.2	52.2	64.6	65.5

Dataset Statistics

Data statistics of our EventBench. (a) Task and category distribution across three groups: understanding (i.e., DU and CR), recognition (i.e., AR, GR, and E-OCR), and spatial reasoning (i.e., SR, AD, and OC). (b) Sample statistics for each task category. (c) Comparison with existing event-based benchmarks across multiple dimensions (i.e., modality, metric type, data source, size, and temporal span). Note that EventBench provides a comprehensive benchmark for systematically evaluating the capabilities of event-based MLLMs.

BibTeX

@article{liu2025eventbench,
  title={EventBench: Towards Comprehensive Benchmarking of Event-based MLLMs},
  author={Liu, Shaoyu and Li, Jianing and Zhao, Guanghui and Zhang, Yunjian and Ji, Xiangyang},
  journal={arXiv preprint arXiv:2511.18448},
  year={2025}
}

More Works from Our Lab

EventGPT: Event Stream Understanding with Multimodal Large Language Models

Event Bench: Towards Comprehensive Benchmarking of Event-based MLLMs

Abstract

Main Results on EventBench

Dataset Statistics

BibTeX

Event Bench: Towards Comprehensive

Benchmarking of Event-based MLLMs