In this blog, I explore the ‘Lost Context Problem’ that often plagues RAG systems and present two innovative techniques: Late Chunking and Contextual Retrieval. These methods can significantly enhance the accuracy of your retrieval systems and minimize frustrating hallucinations.
The Lost Context Problem
The lost context problem is a significant challenge in retrieval-augmented generation (RAG) systems. When an agent retrieves information, it often fails to connect the dots between different pieces of data. This can lead to inaccurate answers or, worse, hallucinations—where the agent generates responses based on unrelated information.
For example, if I take a document about Berlin, standard RAG systems chunk the text into smaller segments, like sentences. However, these segments lose their contextual relationships. If one chunk mentions “Berlin” and another uses “its” without the prior context, the system may struggle to link them. This disconnection can result in incomplete or misleading responses.
To address this, I focus on two innovative techniques: Late Chunking and Contextual Retrieval. These methods aim to retain context throughout the retrieval process, enhancing accuracy and reducing hallucinations.
Understanding Chunking
Chunking involves splitting a document into manageable segments for processing. This strategy helps in analyzing large texts without overwhelming the system. There are various chunking methods, such as creating segments based on sentences or fixed lengths. However, a one-size-fits-all approach doesn’t exist; the optimal strategy depends on the specific document type.
As I explored the chunking process, I found that different strategies yield different results. For instance, overlapping segments can help maintain context, but they also increase complexity. It’s essential to choose a method that aligns with the document’s nature to ensure effective retrieval.
The Late Chunking Strategy
Late chunking is a relatively new approach that redefines how we handle chunking in RAG systems. Instead of chunking first and embedding later, late chunking flips this process on its head. This technique allows for embedding the entire document in one go before chunking it into smaller pieces.
This method leverages long context window embedding models, which can capture a more extensive range of tokens. By embedding first, I can retain the relationships between different parts of the text, thus improving the overall coherence of the retrieved information.
How Late Chunking Works
To implement late chunking, I start by loading the entire document into an embedding model. This approach enables the generation of vector embeddings for each token simultaneously, preserving the context of the entire text. After embedding, I can then apply my preferred chunking strategy—whether by sentences, paragraphs, or fixed lengths.
Once I’ve segmented the document, I identify which vectors correspond to each chunk. This method ensures that even if a specific term isn’t mentioned in a chunk, the context from the entire document is still accessible. By utilizing pooling techniques, I can average the vectors to create a single, representative embedding for each chunk.
Benefits of Late Chunking
- Improved Context Retention: By embedding the entire document first, late chunking helps maintain the relationships between chunks, leading to more accurate responses.
- Enhanced Retrieval Accuracy: The method significantly reduces the chances of hallucinations, as the agent can rely on a richer context when generating answers.
- Flexibility in Chunking Strategy: Late chunking allows for the use of various strategies post-embedding, making it adaptable to different document types.
- Efficiency: By processing larger portions of text at once, late chunking can streamline the retrieval process, especially for long documents.
Implementing Late Chunking in N8N
Implementing late chunking in N8N requires a few manual steps due to limitations in the platform’s support for custom embedding models. I start by fetching a file from a Google Drive folder. This file is typically the same one I use in many of my RAG videos, the Formula One technical regulations, which is quite lengthy at one hundred and eighty pages.
After downloading the file, I check its type. If it’s a PDF, I use the extract from PDF node. For Google Docs, I convert it to markdown. The next step involves checking the length of the extracted text. If it exceeds thirty thousand characters, I create a summary to maintain context within the embedding model’s limitations.
Testing the Late Chunking Implementation
After segmenting the document into manageable parts, I loop through these larger chunks to create more granular segments. The goal is to aim for a chunk size of one thousand characters with a two hundred character overlap. This ensures that the chunks do not cut off words and maintain readability.
I also make use of various separators such as new lines for splitting the text. The chunking functionality in N8N is somewhat limited, as it’s embedded within their vector stores, which necessitates a custom JavaScript function for my specific needs.
Introduction to Contextual Retrieval
Contextual retrieval is another innovative approach that I’m excited to explore. This method was introduced by Entropic and focuses on leveraging the long context window of large language models (LLMs) instead of traditional embedding models. It aims to improve the way chunks are contextualized within a document.
In this approach, documents are first split into chunks, but rather than sending these chunks directly into an embedding model, I send them to an LLM along with the original document. The LLM analyzes the chunk in context, providing a descriptive blurb that explains how it fits into the overall document.
How Contextual Retrieval Works
With contextual retrieval, once I have the document split into chunks, I send each chunk to the LLM along with the original document. The LLM generates a short description that contextualizes the chunk. This additional information is then combined with the chunk and sent to the embedding model to create vectors.
For instance, if I take the Berlin document again, a chunk may receive a description like, “This section discusses Berlin’s population.” This context helps ensure that when the chunk is retrieved, it retains its relevance and connection to the overall document.
Challenges with Contextual Retrieval
While contextual retrieval shows great promise, it comes with its own set of challenges. The primary concern is the time it takes to ingest documents, especially larger ones. Each chunk must be processed through the LLM, which can lead to significant delays.
Additionally, there are cost implications. If a document contains a million tokens, the entire document must be sent to the LLM for every single chunk. This can quickly add up, making it less feasible for larger datasets.
Implementing Contextual Retrieval in N8N
To implement contextual retrieval in N8N, I begin by fetching the document, which often comes from a Google Drive folder. This document can be quite extensive, often exceeding thirty thousand tokens. Once I have the document, I extract its contents based on its type—PDF or Google Doc. If it’s a PDF, I utilize the “Extract from PDF” node; for Google Docs, I convert it to markdown format.
Next, I need to estimate the token length of the extracted text. This step is crucial because it determines whether I can cache the document for context retrieval. If the document is larger than thirty-five thousand tokens, I proceed to encode it in base sixty-four format and send it to the context caching endpoint. This allows me to store the entire document in memory for later reference.
Creating Contextual Blurbs
Once the document is cached, I split it into chunks. Each chunk is then sent to a large language model (LLM) along with the original document. The LLM analyzes each chunk in context and generates a brief description. This descriptive blurb is essential as it ties the chunk back to the overall document, ensuring that the context is retained.
For example, if a chunk discusses Berlin’s population, the LLM might generate a blurb stating, “This chunk pertains to Berlin, specifically discussing its demographics.” This additional information is invaluable for the embedding process, as it provides context that would otherwise be lost.
Sending Chunks for Embedding
After generating the contextual blurbs, I combine them with the original chunks and send this enriched data to the embedding model. This step is crucial for creating the vectors that will be stored in the vector database. I set the chunk size appropriately, ensuring that it adheres to the limitations of the embedding model being used.
In my case, I’m using OpenAI’s text embedding model, which requires careful handling of chunk sizes. I avoid additional chunking at this stage since the data has already been split. The goal here is to ensure that each chunk, along with its context, is accurately represented in the vector store.
Testing Contextual Retrieval
After all chunks have been sent for embedding, I test the contextual retrieval system. I pose various queries to the vector store to evaluate the accuracy and relevance of the responses. The results often reveal the effectiveness of the contextual blurbs in enhancing the quality of the answers.
For instance, if I ask about specific components of an F1 car, I expect a detailed response that reflects the information contained in the document. The contextual blurbs should help ensure that the model retrieves the most pertinent chunks, leading to more accurate answers. In practice, this method tends to yield richer and more informative responses compared to traditional retrieval methods.
Comparing the Two Techniques
When comparing contextual retrieval and late chunking, I find distinct advantages to each approach. Late chunking excels in maintaining context across larger segments of text, while contextual retrieval provides a more nuanced understanding by leveraging LLMs to generate contextual blurbs.
- Contextual Retrieval: Offers improved context through LLM-generated blurbs, which can significantly enhance the quality of responses.
- Late Chunking: Provides a streamlined process for embedding while retaining relationships between chunks, minimizing the risk of losing context.
In practice, the choice between these techniques often depends on the specific use case. For instance, if the primary concern is the accuracy of context in responses, contextual retrieval may be the better option. However, if speed and efficiency are paramount, late chunking might prove more beneficial.
Evaluating Performance
Performance evaluation is vital when implementing either technique. I typically set up a framework for assessing the quality of responses generated by the system. This involves running a series of queries and analyzing the relevance and completeness of the answers.
For example, I might compare the results of a query using standard RAG methods against those obtained through contextual retrieval. Often, the latter yields more comprehensive and detailed answers. This evaluation process helps in refining the techniques and determining which method works best for specific types of documents.