Evaluating the Optimal Document Chunk Size for a RAG Application

9 min readJul 28, 2024

In Retrieval-Augmented Generation (RAG) applications, one crucial factor that significantly impacts performance is the chunk size of documents. This blog post delves into the intricacies of chunk size optimization, demonstrating how it affects a RAG pipeline and how to determine the ideal chunk size for your specific use case.

What Is Chunk Size and How Does It Affect a RAG Pipeline?

Chunk size refers to the length of text segments into which documents are divided before being processed and stored in a vector database. In a RAG pipeline, these chunks are crucial as they form the basic units of information that can be retrieved in response to a query.

*Visual example of text divided into smaller chunks with overlap, demonstrating chunking in a RAG pipeline.*

The choice of chunk size can have several impacts on your RAG system:

Relevance: Smaller chunks can lead to more precise retrieval of relevant information, as they allow for finer-grained matching with queries.
Context: Larger chunks preserve more context, which can be beneficial for understanding complex topics or maintaining coherence in responses.
Performance: Smaller chunks generally result in faster processing and retrieval times, but at the cost of increased storage requirements.
Quality of Embeddings: The size of chunks can affect the quality of vector embeddings, potentially impacting the accuracy of similarity searches.

How to Run the App

1. Set Up Your Environment

To get started, you need to ensure that Python and Streamlit are installed on your system. Follow these steps:

Clone the Repository: First, clone the repository containing the project files using the following Git command:

git clone https://github.com/AI-ANK/Evaluating-the-Ideal-Document-Chunk-Size-for-a-RAG-Application.git

Install Dependencies: Install all required dependencies by running:

pip install -r requirements.txt

Environment Variables: Create an .env file in the root directory of the project to store sensitive data such as API keys:

OPENAI_API_KEY=your_openai_api_key
QDRANT_URL=your_qdrant_url
QDRANT_API_KEY=your_qdrant_api_key

2. Configure Qdrant

Before you can launch the application, you need to set up a Qdrant cluster:

Create a Qdrant Cluster: Follow the steps outlined in the Qdrant documentation to create a cluster in Qdrant Cloud. You can find the guide here: Qdrant Cloud Quickstart.
Configuration: Make sure to note down the URL and API key for your Qdrant cluster. These will be used in your .env file to enable communication between your application and the Qdrant database.
In-Memory Version: For the sake of simplicity, you can use the in-memory version of Qdrant by specifying :memory: in the client initialization. This avoids the need to set up a full Qdrant cluster. In this project, we use this version.

3. Launch the Application

Finally, you can start the application by executing the following command in your terminal:

streamlit run app.py

This will start the Streamlit server and open your default web browser to the URL where the app is hosted, typically http://localhost:8501. You can interact with the app through this interface to explore Airbnb property listings.

Code Deep Dive

Let’s start by importing the required libraries and setting up a basic RAG pipeline using Qdrant as our vector store. We’ll use the Ragas framework to evaluate our pipeline’s performance.

Technology Stack Overview

The project is built upon a well-considered stack of technologies, each chosen for its strength in delivering efficient and scalable solutions for AI-powered applications.

LlamaIndex: LLM orchestration framework.
Qdrant: Vector search engine for efficient and scalable document retrieval.
OpenAI’s GPT-3.5-Turbo: For generating and evaluating responses.
Streamlit: For building an interactive UI to upload documents and display results.
Ragas (RAG Assessment): Ragas is a framework for evaluating Retrieval-Augmented Generation (RAG) systems. It provides a structured approach to assess how well these systems retrieve relevant information and generate accurate, coherent responses. RAG evaluation focuses on the relevance and informativeness of the retrieved data, the coherence and fluency of the generated outputs, and the overall system efficiency. This evaluation is crucial for applications like customer support, content generation, and educational tools, where high-quality, contextually relevant responses are essential.

Metrics in Use:

Context Recall: How well the retrieved information covers the relevant context.
Context Precision: The accuracy of the retrieved chunks in matching the query.
Context Relevancy: Balances precision and recall to measure relevancy.
Context Entity Recall: Evaluates the retrieval of specific entities mentioned in the query.

For a detailed explanation of the metrics used in this evaluation, refer to the “Analyzing the Results” section.

import streamlit as st
import os
import pandas as pd
import nest_asyncio
from dotenv import load_dotenv
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, StorageContext, Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
context_relevancy,
context_entity_recall
)
from ragas.integrations.llama_index import evaluate
from datasets import Dataset
# Load environment variables and initialize Qdrant client
load_dotenv()
qdrant_client = QdrantClient(
url=os.getenv('QDRANT_URL'),
api_key=os.getenv('QDRANT_API_KEY')
)
# Initialize vector store and other components
vector_store = QdrantVectorStore(client=qdrant_client, collection_name="your_collection_name")
storage_context = StorageContext.from_defaults(vector_store=vector_store)
documents = SimpleDirectoryReader("./document").load_data()

Evaluating the RAG Pipeline with Different Chunk Sizes

Now, let’s evaluate how different chunk sizes affect our RAG pipeline’s performance:

chunk_sizes = [25, 1024, 2000]
results = []
for chunk_size in chunk_sizes:
Settings.chunk_size = chunk_size
Settings.chunk_overlap = int(0.05 * chunk_size)
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(similarity_top_k=2)
result = evaluate(
query_engine=query_engine,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall, context_relevancy, context_entity_recall],
dataset=dataset_dict,
llm=OpenAI(model="gpt-3.5-turbo-0125"),
embeddings=OpenAIEmbedding(),
)
results.append(result.to_pandas())
results[-1]['chunk_size'] = chunk_size
all_results_df = pd.concat(results, ignore_index=True)

This code evaluates the pipeline with different chunk sizes and stores the results for comparison.

Analyzing the Results

After running the analysis with different chunk sizes, let’s examine how each chunk size affects various metrics such as:

Context Recall: Reflects how well the retrieved chunks cover the relevant context. Larger chunk sizes generally provide better context recall as they include more information.

Example: For a query about the “applications and challenges for future research related to technetium,” a larger chunk size of 1024 tokens might retrieve a comprehensive paragraph that includes multiple applications and challenges, covering a broad context. In contrast, a smaller chunk size might only retrieve a single application or challenge, missing out on the complete context.

Context Precision: Indicates how accurately the retrieved chunks match the query. Smaller chunk sizes typically offer higher precision due to more fine-grained matching.

Example: For the same query, a smaller chunk size of 25 tokens might retrieve a sentence specifically mentioning “technetium applications in medical imaging,” providing a precise answer. Larger chunks, while offering more context, might include extraneous information, reducing precision.

Context Relevancy: Measures the relevancy of the retrieved chunks to the query. This metric balances between precision and recall.

Example: For the query, a chunk size of 200 tokens might strike a balance by retrieving a few sentences that are highly relevant to “technetium applications” while also providing some context about “future research challenges,” thus achieving high relevancy.

Context Entity Recall: Evaluates the retrieval of specific entities mentioned in the query. It can vary depending on how entities are distributed across the chunks.

Example: If the query mentions multiple entities like “technetium, medical imaging, and nuclear reactors,” a larger chunk size of 1024 tokens might capture all these entities in a single retrieval. Smaller chunks might only capture one or two entities, potentially missing others.

Advanced Retrieval Techniques: Context Enrichment with Sentence Window Retrieval

To balance precision and context retention, we introduce an advanced technique called Context Enrichment with Sentence Window Retrieval. This method optimizes both the retrieval and generation stages by tailoring text chunk size to the specific needs of each process.

The Sentence Window Retrieval technique involves decoupling the retrieval and generation processes to enhance overall system performance. During retrieval, smaller data chunks — specifically individual sentences — are used to achieve precise matching with the query. This approach leverages the advantages of fine-grained data retrieval, which can improve the relevance of the retrieved information.

Grounded in the principle of optimizing both retrieval and generation, this technique initially focuses on retrieving single sentences to ensure precise and relevant information extraction. In the post-processing phase, additional sentences surrounding the retrieved sentence are included to provide a broader context. This expanded context is crucial during the generation phase, where the LLM benefits from a richer dataset, enabling more detailed and accurate responses.

Implementation:

Node Sentence Window Retrieval: In this approach, we first split the document into sentences and create nodes for each sentence.
Metadata Replacement: During post-processing, we replace the original retrieved sentence with a window of surrounding sentences to enrich the context.

Here’s the implementation of the advanced technique:

# Load documents
documents = SimpleDirectoryReader("./document").load_data()
# Initialize Qdrant client
client = QdrantClient(location=":memory:")
# Create two separate vector stores for sentence window and base indices
sentence_vector_store = QdrantVectorStore(client=client, collection_name="sentence_window_collection")
base_vector_store = QdrantVectorStore(client=client, collection_name="base_collection")
# Create storage contexts
sentence_storage_context = StorageContext.from_defaults(vector_store=sentence_vector_store)
base_storage_context = StorageContext.from_defaults(vector_store=base_vector_store)
# Initialize text splitter and LLM
text_splitter = SentenceSplitter()
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
# Update Settings
Settings.llm = llm
Settings.text_splitter = text_splitter
Settings.chunk_size = 50
Settings.chunk_overlap = 10
# Create node parser
node_parser = SentenceWindowNodeParser.from_defaults(
window_size=3,
window_metadata_key="window",
original_text_metadata_key="original_text",
)
# Get nodes from documents
nodes = node_parser.get_nodes_from_documents(documents)
base_nodes = text_splitter.get_nodes_from_documents(documents)
# Create indices using Qdrant vector stores
sentence_index = VectorStoreIndex(nodes, storage_context=sentence_storage_context)
base_index = VectorStoreIndex(base_nodes, storage_context=base_storage_context)
# Create query engine with sentence window retrieval
query_engine = sentence_index.as_query_engine(
similarity_top_k=2,
node_postprocessors=[MetadataReplacementPostProcessor(target_metadata_key="window")]
)
user_question = df_questions_answers.iloc[0]["Question"]
window_response = query_engine.query(user_question)
window = window_response.source_nodes[0].node.metadata["window"]
sentence = window_response.source_nodes[0].node.metadata["original_text"]
# Displaying window and vector responses clearly
st.subheader("Query Results")
st.write(f"**User Question:** {user_question}")
st.write("**Response using Sentence Window Retrieval:**")
st.write(window_response.response)
st.divider()
st.markdown(f"**Window:** \n\n {window}")
st.markdown(f"**Original Sentence:** \n\n {sentence}")
query_engine = base_index.as_query_engine(similarity_top_k=2)
vector_response = query_engine.query(user_question)
st.divider()
st.write("**Response using Normal Retrieval:**")
st.write(vector_response.response)
st.write("**Source Nodes used for Normal Retrieval:**")
for source in vector_response.source_nodes:
st.write(source)

Comparing Results

The results are displayed in the Streamlit UI, allowing for an easy comparison between normal retrieval and sentence window retrieval:

Response Using Sentence Window Retrieval: This method retrieves smaller, precise chunks, and enriches them with a window of surrounding sentences for better context.
Example: For the query on “technetium applications and challenges,” sentence window retrieval might pull a focused sentence on a specific application, along with the preceding and following sentences that provide context about its challenges.

Response Using Normal Retrieval: This method retrieves chunks based on the specified chunk sizes without additional context enrichment.
Example: For the same query, normal retrieval with a chunk size of 1024 tokens might retrieve a broad paragraph covering multiple points but with some less relevant details included.

By comparing these results, we can observe the benefits of sentence window retrieval in balancing precision and context retention.

Conclusion

Determining the ideal chunk size for a RAG application is not a one-size-fits-all solution. It depends on your specific use case, the nature of your documents, and the types of queries you expect to handle. Here are some key takeaways:

Experiment with Different Chunk Sizes: As we’ve seen, different chunk sizes can significantly impact various performance metrics. It’s crucial to test multiple sizes and analyze the results.
Consider Your Use Case: If precision is paramount, smaller chunks might be preferable. For tasks requiring more context, larger chunks or context enrichment techniques may be more suitable.
Balance Performance and Quality: While smaller chunks can improve retrieval speed, they may require more storage and potentially miss important context. Strike a balance that works for your application.
Implement Advanced Techniques: Consider using methods like Sentence Window Retrieval to combine the benefits of small and large chunks.
Continuous Evaluation: As your dataset grows or your use case evolves, periodically re-evaluate your chunk size to ensure optimal performance.

By carefully considering chunk size and implementing appropriate strategies, you can significantly enhance the performance and effectiveness of your RAG application with Qdrant.

Happy chunking!

GitHub Repo

Connect with Me on LinkedIn