Evaluating Image Hallucination in Text-to-Image Generation with Question-Answering

Overview

Abstract

Despite the impressive success of text-to-image (TTI) generation models, existing studies overlook the issue of whether these models accurately convey factual information. In this paper, we focus on the problem of image hallucination, where images created by generation models fail to faithfully depict factual content. To address this, we introduce I-HallA (Image Hallucination evaluation with Question Answering), a novel automated evaluation metric that measures the factuality of generated images through visual question answering (VQA). We also introduce I-HallA v1.0, a curated benchmark dataset for this purpose. As part of this process, we develop a pipeline that generates high-quality question-answer pairs using multiple GPT-4 Omni-based agents, with human judgments to ensure accuracy. Our evaluation protocols measure image hallucination by testing if images from existing TTI models can correctly respond to these questions. The I-HallA v1.0 dataset comprises 1.2K diverse image-text pairs across nine categories with 1,000 rigorously curated questions covering various compositional challenges. We evaluate five TTI models using I-HallA and reveal that these state-of-the-art models often fail to accurately convey factual information. Moreover, we validate the reliability of our metric by demonstrating a strong Spearman correlation (ρ=0.95) with human judgments. We believe our benchmark dataset and metric can serve as a foundation for developing factually accurate TTI generation models.

Method

Overview

I-HallA evaluates image hallucination by leveraging VQA to verify factual details not explicitly mentioned in the text prompts. The pipeline involves generating images from TTI models, creating factual reasoning with GPT-4o, and forming question-answer sets to assess the hallucinations.

Architecture

Dataset Construction

The I-HallA v1.0 dataset includes prompts extracted from five science and history textbooks, chosen for their curated and fact-verified content. Each prompt is paired with factual images and hallucinated images generated by TTI models. The benchmark evaluates whether the images align with factual data based on generated QA sets.

Benchmark Structure

Prompt & Image Collection: Textbook captions and figures serve as sources for factual prompts and images.
Enhancement Using GPT-4o: GPT-4o adds reasoning to the prompt-image pairs to validate factuality.
QA Set Generation: Multiple-choice QA sets are formed to assess image hallucinations, reviewed and refined by human annotators.

Results

Experimental Result

Human Evaluation I-HallA was applied to five models: DALL-E 3, Stable Diffusion v1.4, v1.5, v2.0, and SD XL-base. DALL-E 3 demonstrated the lowest hallucination rates, while all models exhibited issues with factual consistency. I-HallA scores correlate highly with human assessments, confirming the metric’s reliability.

Conclusion

This paper introduces a new method for evaluating image hallucination in TTI models. By leveraging factual reasoning and VQA, I-HallA effectively measures the factuality of generated images. The strong correlation with human judgment shows promise for its future application in improving the factual accuracy of text-to-image models.

Citation

@article{ihalla,
  author = {Youngsun Lim and
                  Hojun Choi and
                  Hyunjung Shim},
  title = {Evaluating Image Hallucination in Text-to-Image Generation with Question-Answering},
  Conference = {AAAI},
  year = {2025},
}