Building a Multimodal RAG That Responds with Text, Images, and Tables from Sources

Introduction

Retrieval-Augmented Generation (RAG) remains one of the most effective applications of generative AI. Yet, few chatbots manage to include images, tables, or figures from their source documents in responses. This article explains the challenges of building a truly multimodal RAG system capable of handling complex document types such as research papers and corporate reports.

Challenges of Multimodal RAG

Documents like reports and academic papers often include dense text, mathematical formulas, graphs, and structured tables. Integrating these varied formats coherently is difficult, and reliable document-level context is frequently lost during retrieval or summarization.

Example of Context Loss

“In real-world documents, this often isn’t true. Example: Context Loss in Corporate Reports.”

Proposed Approach

The author proposes an improved multimodal RAG pipeline designed to produce consistent, high-quality responses that combine text, images, and structured data. The model used in this experiment is GPT-4o, with text-embedding-3-small for generating embeddings.

Pipeline Overview

Construct a multimodal knowledge base using diverse document types.
Retrieve relevant sections combining textual and visual embeddings.
Compose responses that align figures, tables, and text in logical relation to the query.

Assumptions and Limitations

This approach rests on the assumption that the generated caption or summary of an image always carries sufficient context to link it meaningfully with the associated text. However, this assumption often fails, especially in complex corporate or scientific documents.

Conclusion

Creating a stable, multimodal RAG system remains challenging due to data complexity and context misalignment across media forms, but an optimized pipeline can greatly improve multimodal response quality.

Author’s Summary: The article explains the complexity of designing multimodal RAG systems that process text, tables, and images while maintaining context and response quality for dense documents.

Towards Data Science — 2025-11-04