We introduce
R2I-Bench
, a comprehensive benchmark designed to assess the reasoning capabilities of text-to-image (T2I) generation models. It encompasses 7 primary reasoning categories, which are further subdivided into 32 fine-grained subcategories.
Reasoning is a fundamental capability often required in real-world text-to-image (T2I) generation, e.g., generating a bitten apple that has been left in the air for more than a week necessitates understanding temporal decay and commonsense concepts. While recent T2I models have made impressive progress in producing photorealistic images, their reasoning capability remains underdeveloped and insufficiently evaluated.
To bridge this gap, we introduce R2I-Bench, a comprehensive benchmark specifically designed to rigorously assess reasoning-driven T2I generation.
R2I-Bench comprises 3,068 meticulously curated data instances, spanning 7 core reasoning categories, including commonsense, mathematical, logical, compositional, numerical, causal, and concept mixing. To facilitate fine-grained evaluation, we design
R2I-Score, a QA-style metric based on instance-specific, reasoning-oriented evaluation questions that assess three critical dimensions: text-image alignment, reasoning accuracy, and image quality.
R2I-Bench a comprehensive benchmark specifically designed to rigorously assess reasoning-driven T2I generation.
R2I-Bench comprises 3,068 meticulously curated data instances, spanning 7 core reasoning categories, including commonsense, mathematical, logical, compositional, numerical, causal, and concept mixing. To facilitate fine-grained evaluation, we design
R2I-Score, a QA-style metric based on instance-specific, reasoning-oriented evaluation questions that assess three critical dimensions: text-image alignment, reasoning accuracy, and image quality.
Key statistics of
R2I-Bench.
Distribution of Diverse Reasoning Categories in
R2I-Bench.
Caus.: Causal,
Con. Mix.: Concept Mixing,
Comp.: Compositional,
Num.: Numerical,
Comm.: Commonsense,
Math.: Mathematical.
Example Illustration of R2I-Bench and
R2I-Score
Data Curation Pipeline of
R2I-Bench.
Failure Cases of the Pipeline-based Framework on Compositional, Numerical, Mathematical Reasoning.
Accuracy scores of the top-performing four models from four different model architectures respectively on
R2I-Bench.
Distribution of Errors of the top-performing four models from four different model architectures respectively on
R2I-Bench.
Detailed Performance Comparison: Standard T2I Model vs. Pipeline-based Framework
@misc{chen2025r2ibenchbenchmarkingreasoningdriventexttoimage,
title={R2I-Bench: Benchmarking Reasoning-Driven Text-to-Image Generation},
author={Kaijie Chen and Zihao Lin and Zhiyang Xu and Ying Shen and Yuguang Yao and Joy Rimchala and Jiaxin Zhang and Lifu Huang},
year={2025},
eprint={2505.23493},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.23493},
}