pw.xpacks.llm.splitters

A library of text spliiters - routines which slit a long text into smaller chunks.

class pw.xpacks.llm.splitters.TokenCountSplitter(min_tokens=50, max_tokens=500, encoding_name='cl100k_base')

[source]
Splits a given string or a list of strings into chunks based on token count.

This splitter tokenizes the input texts and splits them into smaller parts (“chunks”) ensuring that each chunk has a token count between min_tokens and max_tokens. It also attempts to break chunks at sensible points such as punctuation marks.

All arguments set default which may be overridden in the UDF call

  • Parameters
    • min_tokens (int) – minimum tokens in a chunk of text.
    • max_tokens (int) – maximum size of a chunk in tokens.
    • encoding_name (str) – name of the encoding from tiktoken.

Example:

from pathway.xpacks.llm.splitters import TokenCountSplitter
import pathway as pw
t  = pw.debug.table_from_markdown(
    '''| text
1| cooltext'''
)
splitter = TokenCountSplitter(min_tokens=1, max_tokens=1)
t += t.select(chunks = splitter(pw.this.text))
pw.debug.compute_and_print(t, include_id=False)