R2I-Bench: Evaluating Reasoning Capabilities of Image Generation

We introduce Logo R2I-Bench , a comprehensive benchmark designed to assess the reasoning capabilities of text-to-image (T2I) generation models. It encompasses 7 primary reasoning categories, which are further subdivided into 32 fine-grained subcategories.

Abstract

Reasoning is a fundamental capability often required in real-world text-to-image (T2I) generation, e.g., generating a bitten apple that has been left in the air for more than a week necessitates understanding temporal decay and commonsense concepts. While recent T2I models have made impressive progress in producing photorealistic images, their reasoning capability remains underdeveloped and insufficiently evaluated.

To bridge this gap, we introduce Logo R2I-Bench, a comprehensive benchmark specifically designed to rigorously assess reasoning-driven T2I generation. Logo R2I-Bench comprises 3,068 meticulously curated data instances, spanning 7 core reasoning categories, including commonsense, mathematical, logical, compositional, numerical, causal, and concept mixing. To facilitate fine-grained evaluation, we design Logo R2I-Score, a QA-style metric based on instance-specific, reasoning-oriented evaluation questions that assess three critical dimensions: text-image alignment, reasoning accuracy, and image quality.

Extensive experiments with 16 representative T2I models, including a strong pipeline-based framework that decouples reasoning and generation using the state-of-the-art language and image generation models, demonstrate consistently limited reasoning performance, highlighting the need for more robust, reasoning-aware architectures in the next generation of T2I systems.

Leaderboard on R2I-Bench

Overview

Logo R2I-Bench a comprehensive benchmark specifically designed to rigorously assess reasoning-driven T2I generation. Logo R2I-Bench comprises 3,068 meticulously curated data instances, spanning 7 core reasoning categories, including commonsense, mathematical, logical, compositional, numerical, causal, and concept mixing. To facilitate fine-grained evaluation, we design Logo R2I-Score, a QA-style metric based on instance-specific, reasoning-oriented evaluation questions that assess three critical dimensions: text-image alignment, reasoning accuracy, and image quality.

Key statistics of Logo R2I-Bench.

Distribution of Diverse Reasoning Categories in Logo R2I-Bench.
Caus.: Causal, Con. Mix.: Concept Mixing, Comp.: Compositional,
Num.: Numerical, Comm.: Commonsense, Math.: Mathematical.

Example Illustration of Logo R2I-Bench and Logo R2I-Score

Data Curation Pipeline of Logo R2I-Bench.

Failure Cases of the Pipeline-based Framework on Compositional, Numerical, Mathematical Reasoning.

Results on Existing Text-to-Image Models

Accuracy scores of the top-performing four models from four different model architectures respectively on Logo R2I-Bench.

Distribution of Errors of the top-performing four models from four different model architectures respectively on Logo R2I-Bench.

Detailed Performance Comparison: Standard T2I Model vs. Pipeline-based Framework

Visualization Examples

BibTeX

@misc{chen2025r2ibenchbenchmarkingreasoningdriventexttoimage,
      title={R2I-Bench: Benchmarking Reasoning-Driven Text-to-Image Generation}, 
      author={Kaijie Chen and Zihao Lin and Zhiyang Xu and Ying Shen and Yuguang Yao and Joy Rimchala and Jiaxin Zhang and Lifu Huang},
      year={2025},
      eprint={2505.23493},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.23493}, 
}

R2I-Bench

Benchmarking Reasoning-Driven Text-to-Image Generation