Building a Basic RAG System
RAG (Retrieval-Augmented Generation) is a tool that expands the capabilities of language models (LLMs) by “connecting” them to a knowledge base of any size. It enables enriching model responses with relevant information from documents, overcoming context window limitations and ensuring data relevance.
Key Stages of RAG System Development
The development path of a basic RAG system includes four key stages:
-
Parsing: Preparing data for the knowledge base — collecting documents, converting them to text format, and cleaning them of redundant information.
-
Ingestion: Creating and populating the knowledge base with structured information from processed documents.
-
Retrieval: Developing a mechanism that finds and returns relevant data in response to a user query. Typically involves semantic search across a vector database.
-
Answering: Enriching the user’s question prompt with retrieved data, sending it to an LLM, and generating the final response.
╔═════════════════════════════════════════════════════════════════════════════╗
║ INGESTION PIPELINE ║
║ ║
║ ┌────────────────┐ parsing ┌─────────────────┐ embed & ╭──────────╮ ║
║ │ PDF/DOC/TXT/MD │ ──────────▶ │ Text Cleaning │ ────────▶ │ Vector │ ║
║ │ Files │ text │ & Chunking │ store │ DB │ ║
║ └────────────────┘ └─────────────────┘ ╰─────┬────╯ ║
╚═════════════════════════════════════════════════════════════════════════════╝
┆
route to ┆
relevant DB ┆
▼
╔═════════════════════════════════════════════════════════════════════════════╗
║ ANSWERING PIPELINE ║
║ ║
║ ┌─────────────────────┐ ║
║ │ User Question │ ║
║ └──────────┬──────────┘ ║
║ ┌────────────────┼────────────────┐ ║
║ │ │ │ ║
║ ▼ │ ▼ ║
║ ┌────────────────────────────┐ │ ┌────────────────────────────┐ ║
║ │ RETRIEVAL PROCESS │ │ │ PROMPT GENERATION │ ║
║ │ │ │ │ │ ║
║ │ ╭──────────╮ │ │ │ ┌──────────────────┐ │ ║
║ │ │ Vector │◀─ query │ │ │ │ Prompt Collection│ │ ║
║ │ │ DB │ embeddings │ │ │ └────────┬─────────┘ │ ║
║ │ ╰────┬─────╯ │ │ │ │ select │ ║
║ │ │ retrieve │ │ │ ▼ │ ║
║ │ ▼ candidates │ │ │ ┌──────────────────┐ │ ║
║ │ ┌──────────────┐ │ │ │ │ Prompt Template │ │ ║
║ │ │ LLM Re-rank │ │ │ │ └────────┬─────────┘ │ ║
║ │ └──────┬───────┘ │ │ │ │ │ ║
║ │ │ select top │ │ └────────────┼───────────────┘ ║
║ │ ▼ │ │ │ ║
║ │ ┌──────────────────┐ │ │ │ ║
║ │ │ Relevant Context │ │ │ │ ║
║ │ └────────┬─────────┘ │ │ │ ║
║ │ │ │ │ │ ║
║ └────────────┼───────────────┘ │ │ ║
║ │ │ │ ║
║ │ provide context │ pass-through │ structure query ║
║ │ │ │ ║
║ └────────────────────┼─────────────────┘ ║
║ ▼ ║
║ ┌─────────────────────┐ ║
║ │ LLM Request │ ║
║ └──────────┬──────────┘ ║
║ │ generate response ║
║ ▼ ║
║ ┌─────────────────────┐ ║
║ │ Final Answer │ ║
║ └─────────────────────┘ ║
╚═════════════════════════════════════════════════════════════════════════════╝
The diagram above summarizes the RAG pipeline. The right side (Ingestion) shows how raw files are parsed, cleaned, and chunked into text before being embedded into a vector database. The left side (Answering) shows how a user’s question triggers the retrieval of relevant context (with optional reranking), the selection of a prompt template, and the final LLM request that produces the answer.
1. Parsing: Overcoming Conversion Challenges
The first step is to convert source documents (PDFs, Word docs, etc.) into a clean text format. Parsing PDF documents is a non-trivial task with multiple technical challenges:
- Preserving table structures
- Correct processing of formatting elements (headings, lists)
- Recognition of multi-column text
- Processing charts, images, formulas, headers and footers
2. Ingestion: Creating a Knowledge Base
After converting PDFs to text format (Markdown) and cleaning them, it’s necessary to create databases for storing and searching information.
Chunking Strategy
The simplest approach is to use a document page as a unit of information storage (chunk). However, to improve search accuracy, it’s advisable to use smaller semantic units. Practice shows that information sized at 8-10 sentences (200–500 words) is usually sufficient to form a complete answer, and smaller chunk size increases relevance during search.
Vectorization
For efficient searching, instead of creating one general database, a separate vector database is created for each document (for example, for each company report). This prevents mixing information about different companies and simplifies finding the necessary data in the context of a specific document.
3. Retrieval: Finding Relevant Information
The Retriever is a search system that takes a query and returns relevant text containing information for the answer. The quality of this component is critically important: if the necessary information doesn’t make it into the LLM’s context, a quality answer is impossible — “Junk in - Junk out.”
┌─────────────────────┐
│ Question │
└──────────┬──────────┘
│
│ "What is the Capital
│ of France?"
▼
┌──────────────────────────────────────┐
│ RETRIEVAL PROCESS │
│ │
│ ┌─────────────────┐ │
│ │ Embedding Model │ │
│ └────────┬────────┘ │
│ │ [Vector] │
│ ▼ │
│ ╭───────────╮ │
│ │ Vector DB │ │
│ ╰─────┬─────╯ │
│ │ Top N chunks │
│ ▼ │
│ ┌───────────────┐ │
│ │ Parent Page │ │
│ │ Retrieval │ │
│ └───────┬───────┘ │
│ │ Top N pages │
│ ▼ │
│ ┌───────────────┐ │
│ │ LLM Reranking │ │
│ └───────┬───────┘ │
│ │ Top N pages │
└────────────────┼─────────────────────┘
│
▼
┌─────────────────┐
│ Context │
│ for answering │
└─────────────────┘
In the flowchart above, the user’s question is embedded and used to query the vector store. The system retrieves the top-N similar chunks (for example, N=30). These chunks are then traced back to their parent pages or documents. A re-ranking step (such as a cross-encoder or even an LLM-based scorer) is applied to these pages, and the highest-ranked ones are selected. The final output is the top-K pages (e.g. K=10) of context, combined into a unified block for the LLM. This context, along with the question, forms the prompt for answer generation.
Methods for Improving Search Quality
Hybrid Search: Vector Search + BM25
Hybrid search combines semantic vector-based search with classic BestMatch25 text search based on keywords. It takes into account both the meaning of the text and exact matches of words from the query. Results from both methods are combined and ranked according to a composite relevance score.
Cross-encoder Reranking
Reranking vector search results using a Cross-encoder model provides a more accurate relevance assessment. Unlike comparing texts through their vector representations, where some information is lost during vectorization, cross-encoders evaluate semantic similarity between two texts directly, giving a more accurate result.
LLM Reranking
This approach uses a language model to evaluate relevance. The text and question are sent to an LLM with a request to assess the usefulness of the text for answering on a scale from 0 to 1. Previously, this approach was impractical due to the high cost of quality LLMs, but with the emergence of fast and efficient models, it has become a practical solution.
Parent Page Retrieval
After finding the most relevant chunks, this method uses them only as pointers to pages that then go into the query context. This allows including not only the directly relevant fragment but also useful surrounding information from the page.
Architecture of the Final Retriever
- Vectorizing the user’s question
- Searching for the top 30 relevant chunks based on the question vector
- Extracting corresponding pages using chunk metadata (with deduplication)
- Passing pages to the LLM reranker to refine relevance assessment
- Adjusting the relevance score of pages
- Returning the top 10 pages with page number information and combining them into a unified context
4. Answering: Generating the Response
In the final stage, relevant information extracted during the Retrieval phase is combined with the user’s question and sent to the language model. The LLM analyzes the provided information and generates the final answer based on relevant data from the documents.
Methods for Improving Answer Quality
Routing Multi-component Queries
When working with complex queries requiring information from different sources or documents, an effective solution is to decompose them into sub-queries. The system analyzes the original question, breaks it down into components, directs each to the appropriate knowledge base, and then combines the results to form a comprehensive answer.
Chain of Thoughts (CoT)
CoT is a way to significantly increase the quality of answers by making the model “reason” before producing the final result. Instead of giving an answer immediately, the LLM generates a chain of intermediate steps that help arrive at the answer. This approach is particularly effective for complex, multi-stage queries requiring logical reasoning.
Structured Outputs
SO (Structured Outputs) is the ability to specify a strict response format for the model. It’s usually passed to the API as a separate field in the form of Pydantic or JSON schema. This method forces the model to always respond with valid JSON that fully corresponds to the specified format, which significantly simplifies subsequent processing and integration of responses into various systems.
One-shot Prompts
This is another common technique that is quite simple and effective: if, in addition to instructions, you add an example answer to the prompt, the quality and consistency of responses will significantly improve. The model receives a clear example of what format and style of answer is expected, which helps it better meet the requirements of the task.
RAG use cases
RAGs have a lot of different uses. Firstly, it’s any chatbots designed for customer service, counselling and technical support. Instead of storing all your documents in one big model, you can get the info you need from a repository that’s always up to date. For example, in e-commerce, when users ask for product info, the system could get the descriptions from the database and give them straight away. Secondly, RAG is a tool for creating business-analytical assistants that allow managers to track sales, performance and other data. Another important area where it’s used is in scientific and medical systems, where you need to be able to access evidence, articles and clinical guidelines with a high degree of reliability. The language model is less likely to ‘make things up’ as it’s augmented with factual information extracted from verified sources, making the approach valuable in cases where the cost of error is too high.
RAG is essentially a prompt-engineering tool, so it can be used purely technically to modify the behaviour of the model during message processing. For example, it can dynamically substitute clarifying prompts or search the chat history to extract the most relevant user queries and model responses.
Conclusion
Creating an effective RAG system is an iterative process of improving each component: from quality parsing and well-thought-out chunking strategy to sophisticated search methods and response generation. Each stage contributes to the final quality of the system, and optimizing the entire chain can significantly increase the accuracy and usefulness of the answers.