So, what exactly is Chunking?
Imagine providing a PDF book to someone to summarize its key concepts. However, you discover that this person recalls mainly the concepts from the book's final chapters and not much from the earlier ones. This situation is common for all of us. Without revisiting concepts, it's challenging to remember them, especially if we consume a significant amount of information afterward. This memory timeframe is known as a context window or context length. Similarly, LLMs also have a context length, beyond which they struggle to retain information.
To address this challenge, we use text-splitting. This technique involves dividing the entire text corpus into smaller segments. These segments are sized to fit within the context window of the LLM we intend to use. In short, the length of the segments is reduced to be smaller than the context length of the LLM.
But how should we chunk a large text?
“Nature is like a big
playground! where a
nimals, birds, and i
nsects live and play
together. Imagine a
beautiful forest wi
th tall trees where
birds sing songs all
day. There are anim
als like - squirrels
who jump from tree
to tree, and rabbits who hop around in t he grass.”
Here, every line is a chunk. It's obvious that each chunk fails to capture the complete context after abruptly ending the chunk every time, which could lead to confusion for the LLM during the retrieval process. Consequently, our LLM might generate irrelevant results.Next is the RecursiveCharacterTextSplitter which divides a large text into smaller chunks by utilizing a set of characters as the separator (with the default set including ["\n\n", "\n", " ", ""]). Initially, it attempts to split the text using the "\n\n" character. If the resulting splits are still larger than the specified chunk size, it proceeds to the next character, "\n", and repeats the splitting process. This process continues until a split smaller than the specified chunk size is obtained.
Suppose you have text stored in a more complex format like a PDF or Markdown file format and you need to extract and segment it. In such cases, the DocumentTextSplitter is more suitable as it accommodates diverse data formats. It operates similarly to the Recursive Character TextSplitter but uses distinct separators tailored for each file type. For instance, for Markdown files, it has specific characters as separators like:
Semantic chunking involves utilizing embeddings to assign meaning to textual data. By comparing the embeddings of two texts and calculating the distance or dissimilarity between them, we can determine their relatedness. This approach allows us to utilize semantic similarity for grouping or chunking text.
Agent-based chunking involves generating chunks that may not directly correspond to segments of the original text, but are re-created to convey meaningful statements independently. Unlike semantic chunking, which relies on embedding models, agent-based chunking utilizes a Large Language Model (LLM) to generate each chunk. These chunks can also be grouped to create bigger chunks with better context.