Chunking Techniques – Splitters

Goal

Choosing chunking strategy

Data

Module 'Kritisches Denken'. Specific long chunk that was identified as problematic in a previous evaluation because the chunk was not retrieved for various questions.

Method/Approach

Splitters tested: • Sentence Splitter • Semantic Splitter • Human best-try Splitter • Semantic Double Merging Splitter • Simple Dot Splitter Tested for chunk sizes 100,150, and 200. Sbert vector embeddings for similarity search.

Results

With the sentence splitter, the best retrieval quality reached 65% (CG at k=5) with a chunk size of 100. The semantic splitter, with a certain combination, reached 70%, in most cases around 50-60%. Semantic double merging achieved 65% at the best, and the simple dot splitter 62%.

Evaluation Metrics

Retrieval quality: Cumulative Gain (CG at k=1 to k=6)

Conclusions

The sentence splitter was selected for its balance between retrieval quality and simplicity. The semantic splitter reached the highest retrieval quality at some point, but the right combination of parameters buffer and breakpoint need to be found, adding complexity to the system without providing a big gain on performance.


Optimizing chunking techniques

To ensure that relevant information is retrieved and passed to the LLM for generating contextually appropriate responses, large documents are broken down into smaller pieces of text (chunks).

Sentence Splitter

A straightforward technique that avoids cutting sentences prematurely. The splitter tries to keep sentences and paragraphs together. Therefore, compared to the most simple character-based splitter, there are less likely to be hanging sentences or parts of sentences at the end of a chunk.

Semantic Splitter

Organizes text based on semantic similarity. Instead of chunking text with a fixed chunk size, the semantic splitter uses embedding similarity to ensure that a text chunk contains sentences that are semantically related to each other. It is useful for identifying coherent and related chunks of information within a larger body of text. However, it needs an embedding model, a measure of similarity, and it relies on setting the appropriate parameters.

Semantic Double Merging

It extends traditional semantic chunking by adding a second pass that merges chunks to create more content-rich units. Initially, text is divided based on semantic coherence using measures like percentiles or standard deviations. In the second pass, the algorithm evaluates the similarity between the current chunk and a chunk two positions ahead. If a strong cosine similarity is found, all three chunks—the current one and the next two—are merged, even if the immediate chunks are not textually similar. Since this approach helps retain broader context, it is best suited for scenarios where understanding the overall meaning is more important.

Our Findings

When we tested these three techniques on our question and answering case data we observed only a slight improvement from the sentence splitter to the semantic splitters. Considering that the implementation of semantic splitters would increase complexity and resources (additional embedding calculations) in our system without adding sufficient benefit, we opted for using a sentence splitter. Moreover, optimizing chunk size with the sentence splitter yields sometimes results as good as with the semantic splitter, at least for our current use case.


Experimental Setup

In Short: The experiments showed that chunk size positively impacts the retrieval quality of our AI Tutor's RAG system. We tested chunk sizes of 100 and 150 tokens and found that each had its own advantages depending on the course content. Ultimately, we chose to standardize on a chunk size of 150, which leads to a noticeable improvement in retrieval quality across our test cases.

Goal: To determine the chunk size that yields the best retrieval performance, optimizing for the highest cumulative gain metric (CG at k).

Chunking Technique:

We used a Sentence Splitter for chunking (derived from previous experiments), which segments documents based on sentence boundaries to maintain context, controlled by the specified chunk size (in tokens) and chunk overlap. An overlap parameter of 5 was applied, meaning each chunk included a 5-token overlap with the previous chunk to capture context continuity. The chunks were indexed in our OpenSearch Index with two variations: 100 and 150 chunk sizes.

Vector Search and Retrieval:

Questions were encoded using the SBERT model (thenlper/gte-large). The index was built in OpenSearch using the following hnsw parameters: M=24, ef_search=100, ef_construction=128. Vector searches were performed with a size parameter set to retrieve a maximum of either 6 or 10 results per query. Relevant chunks were labeled 1 if they contained the answer to the question, 0 otherwise.

Data

Two prototype courses were used:

  • Kritisches Denken

    • ServiceID: 2701 (Chunk size = 100), Total chunks: 102

    • ServiceID: 2699 (Chunk size = 150), Total chunks: 88

  • Agiles Mindset

    • ServiceID: 2702 (Chunk size = 100), Total chunks: 201

    • ServiceID: 2700 (Chunk size = 150), Total chunks: 158

Each course was evaluated using 20 content-related questions to measure retrieval quality.

Results and Conclusion

  • For Kritisches Denken (KD), the optimal chunk size was 100, with improved results observed at the top 4 results (CG at k=4).

  • For Agiles Mindset (AM), a chunk size of 150 provided better results overall. The retrieval is able to find the relevant chunk in 70% of the cases within the top 3 positions.

Despite course-specific variations, a chunk size of 150 was selected for consistency. This new chunking strategy, combined with optimized retrieval parameters, improves notably the retrieval quality compared to previously indexed chunks of different sizes (ServiceID 2677 and 2675) where the best CG achieved was 50% for AM and 40% for KD. We should however continue monitoring retrieval performance as more courses are added to ensure the chosen chunk size and retrieval parameters remain optimal.

Retrieval quality for different chunk sizes.

OpenSearch scores of retrieved chunks by rank position.

Red data points indicate relevant chunks. Chunk size is 150.

Test Questions


Some extra details

First experiments show using the xapi chunk strategy delivers better results than the simple pdf page chunk strategy. Maybe we can think of parsing the chunks during the index process (to remove useless chunks if exists)

https://eduplex.atlassian.net/browse/EDX-518

A second experiment with 2 different courses, is showing that xapi performs better than pdf. We noticed that sometimes the chunks do not contain all the needed context to answer all questions and we are getting some questions that are not answered.

Last updated