Simple RAG architecture design -Indexing and Inference pipelines

Grounding the LLM to the data you care about.

As our interactions with AI and conversational systems become increasingly common, we’ve all experienced moments where a chatbot drifts off-topic, discussing irrelevant information when we’re seeking answers grounded in a specific context. The key challenge is ensuring that the chatbot stays focused on the data or document we care about and generates responses based on that context. This is precisely where Retrieval-Augmented Generation (RAG) proves valuable.

Retrieval-Augmented Generation

RAG is a hybrid architecture that combines two key components: a retriever and a generator.

The retriever is responsible for fetching relevant pieces of information from a large corpus or knowledge base, while the generator (typically a large language model) uses this retrieved content to craft a context-aware response. This architecture aims to ground the model’s output in facts and domain-specific data, thus improving relevance, factual accuracy, and controllability.

Traditional LLMs, when used in isolation, rely solely on their pre-trained parameters to answer questions. This leads to a few key limitations:

They may hallucinate facts.
They cannot access new or proprietary information unless they are fine-tuned on it.
They struggle to stay within a defined domain of knowledge.

RAG addresses these issues by introducing an external source of truth during the inference process.

A typical RAG system consists of two main stages: indexing and inference.

Indexing Phase

This phase involves preparing and storing the documents in a form suitable for retrieval at inference time. A simple indexing pipeline may look like the following:

Document Ingestion
Start with a collection of documents, which may be in formats such as PDFs, Word files, HTML, or plain text.
Parsing and Preprocessing
Use document parsers (e.g. PDF readers) to extract raw text content. This may also include removing boilerplate, normalising whitespace, or handling non-textual elements.
Chunking
Split the documents into smaller, manageable chunks (by paragraphs or sliding windows of N tokens). It’s important that each chunk contains enough and semantically meaningful information, so it can provide relevant context to our LLM during generation.
Embedding Generation
Use a sentence embedding model (e.g. Sentence-BERT, OpenAI’s text-embedding-3-small, or Cohere embeddings) to convert each chunk into a dense vector representation.
Storage in Vector Database
Store these embeddings in a vector database alongside metadata like document ID, source and text. This allows for efficient similarity search for retrieval at inference time.

Simple document indexing pipeline for RAG systems

Inference Phase

When a user submits a query, the system goes through the following steps:

Query Embedding
The input query is converted into a dense vector using the same embedding model employed during the indexing phase. This ensures alignment in vector space between the query and document chunks.
Context Retrieval
The query embedding is used to perform a similarity search, typically using cosine similarity or approximate nearest neighbours, in the vector database. This retrieves the top-K most relevant document chunks based on semantic similarity.
LLM Input Construction
The retrieved chunks are formatted and injected into the input prompt of the language model, along with the original user query.
LLM Generation and Output Post-processing
The LLM generates a response grounded in the retrieved context. This output may be post-processed to enforce constraints such as user-specific policies, safety filters, or formatting requirements. The final response is then returned to the user.

Conclusion

RAG provides a robust framework for building context-aware, domain-specific AI systems by combining the strengths of information retrieval with large language models. By grounding responses in retrieved content, RAG systems improve factual accuracy and enable dynamic knowledge integration.

There are numerous optimisation strategies for deploying RAG-based applications in production. I highly encourage exploring these in depth, and I plan to share some of the techniques I have used in my own projects in upcoming articles.

Ultimately, as the demand for trustworthy and adaptable AI grows, RAG stands out as a practical and scalable solution for real-world applications.