Chunking
Embedding entire documents as a single vector often leads to poor retrieval performance. This happens because the model is forced to compress all the document's information into a single representation, making it difficult to capture granular details. As a result, important context may be lost, and retrieval effectiveness decreases.
There a several strategies how to best chunk a document. A simple approach might involve slicing the text every n characters. However, this can split sentences or phrases awkwardly, resulting in incomplete or distorted chunks. Additionally, token counts vary (a token might be a character, word, or punctuation), making it hard to manage consistent chunk sizes with character-based splitting.
A better method is to chunk the text by tokens, ensuring each chunk makes sense and aligns with sentence or paragraph boundaries. Token-based chunking is typically done at logical breakpoints, such as periods, commas, or newlines.
TokenCountSplitter
Pathway offers a TokenCountSplitter
for token-based chunking. Here's how to use it:
from pathway.xpacks.llm.splitters import TokenCountSplitter
text_splitter = TokenCountSplitter(
min_tokens=100,
max_tokens=500,
encoding_name="cl100k_base"
)
This configuration creates chunks of 100–500 tokens using the cl100k_base
tokenizer, compatible with OpenAI's embedding models.
For more on token encodings, refer to OpenAI's tiktoken guide.