Chunking

Embedding entire documents as a single vector often leads to poor retrieval performance. This happens because the model is forced to compress all the document's information into a single representation, making it difficult to capture granular details. As a result, important context may be lost, and retrieval effectiveness decreases.

There a several strategies how to best chunk a document. A simple approach might involve slicing the text every n characters. However, this can split sentences or phrases awkwardly, resulting in incomplete or distorted chunks. Additionally, token counts vary (a token might be a character, word, or punctuation), making it hard to manage consistent chunk sizes with character-based splitting.

A better method is to chunk the text by tokens, ensuring each chunk makes sense and aligns with sentence or paragraph boundaries. Token-based chunking is typically done at logical breakpoints, such as periods, commas, or newlines.

TokenCountSplitter

Pathway offers a TokenCountSplitter for token-based chunking. Here's how to use it:

from pathway.xpacks.llm.splitters import TokenCountSplitter

text_splitter = TokenCountSplitter(
    min_tokens=100,
    max_tokens=500,
    encoding_name="cl100k_base"
)

This configuration creates chunks of 100–500 tokens using the cl100k_base tokenizer, compatible with OpenAI's embedding models.

For more on token encodings, refer to OpenAI's tiktoken guide.