pw.xpacks.llm.splitters
A library of text spliiters - routines which slit a long text into smaller chunks.
class pw.xpacks.llm.splitters.TokenCountSplitter(min_tokens=50, max_tokens=500, encoding_name='cl100k_base')
[source]Splits a given string or a list of strings into chunks based on token count.
This splitter tokenizes the input texts and splits them into smaller parts (“chunks”) ensuring that each chunk has a token count between min_tokens and max_tokens. It also attempts to break chunks at sensible points such as punctuation marks.
All arguments set default which may be overridden in the UDF call
- Parameters
- min_tokens (
int
) – minimum tokens in a chunk of text. - max_tokens (
int
) – maximum size of a chunk in tokens. - encoding_name (
str
) – name of the encoding from tiktoken.
- min_tokens (
Example:
from pathway.xpacks.llm.splitters import TokenCountSplitter
import pathway as pw
t = pw.debug.table_from_markdown(
'''| text
1| cooltext'''
)
splitter = TokenCountSplitter(min_tokens=1, max_tokens=1)
t += t.select(chunks = splitter(pw.this.text))
pw.debug.compute_and_print(t, include_id=False)
__call__(text, **kwargs)
sourceSplit given strings into smaller chunks.
- Parameters
- messages (
ColumnExpression
[str]
) – Column with texts to be split - **kwargs – override for defaults set in the constructor
- messages (