pw.xpacks.llm.splitters
A library of text spliiters - routines which slit a long text into smaller chunks.
class BaseSplitter(*, return_type=..., deterministic=False, propagate_none=False, executor=AutoExecutor(), cache_strategy=None)
[source]Abstract base class for splitters that split a long text into smaller chunks.
__call__(text, **kwargs)
sourceSplit a given text into smaller chunks. Preserves metadata and propagates it to all chunks that are created from the same input string.
- Parameters
- text (
ColumnExpression
) – input column containing text to be split - **kwargs – override for defaults settings from the constructor
- text (
- Returns
pw.ColumnExpression –
A column of pairs: (chunktext
,metadata
).Metadata are propagated to all chunks created from the same input string. If no metadata is provided, an empty dictionary is used.
class NullSplitter(*, return_type=..., deterministic=False, propagate_none=False, executor=AutoExecutor(), cache_strategy=None)
[source]A splitter which returns its argument as one long text ith null metadata.
- Parameters
txt – text to be split - Returns
list of pairs – chunk text and metadata.
The null splitter always return a list of length one containing the full text and empty metadata.
__call__(text, **kwargs)
sourceSplit a given text into smaller chunks. Preserves metadata and propagates it to all chunks that are created from the same input string.
- Parameters
- text (
ColumnExpression
) – input column containing text to be split - **kwargs – override for defaults settings from the constructor
- text (
- Returns
pw.ColumnExpression –
A column of pairs: (chunktext
,metadata
).Metadata are propagated to all chunks created from the same input string. If no metadata is provided, an empty dictionary is used.
class TokenCountSplitter(min_tokens=50, max_tokens=500, encoding_name='cl100k_base')
[source]Splits a given string or a list of strings into chunks based on token count.
This splitter tokenizes the input texts and splits them into smaller parts (“chunks”) ensuring that each chunk has a token count between min_tokens and max_tokens. It also attempts to break chunks at sensible points such as punctuation marks. Splitter expects input to be a Pathway column of strings OR pairs of strings and dict metadata.
All default arguments may be overridden in the UDF call
- Parameters
- min_tokens (
int
) – minimum tokens in a chunk of text. - max_tokens (
int
) – maximum size of a chunk in tokens. - encoding_name (
str
) – name of the encoding from tiktoken. For a list of available encodings please refer to the tiktoken documentation: https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken
- min_tokens (
Example:
from pathway.xpacks.llm.splitters import TokenCountSplitter
import pathway as pw
t = pw.debug.table_from_markdown(
'''| text
1| cooltext'''
)
splitter = TokenCountSplitter(min_tokens=1, max_tokens=1)
t += t.select(chunks = splitter(pw.this.text))
pw.debug.compute_and_print(t, include_id=False)
__call__(text, **kwargs)
sourceSplit a given text into smaller chunks. Preserves metadata and propagates it to all chunks that are created from the same input string.
- Parameters
- text (
ColumnExpression
) – input column containing text to be split - **kwargs – override for defaults settings from the constructor
- text (
- Returns
pw.ColumnExpression –
A column of pairs: (chunktext
,metadata
).Metadata are propagated to all chunks created from the same input string. If no metadata is provided, an empty dictionary is used.
chunk(text, metadata={}, **kwargs)
sourceSplit a given string into smaller chunks. Preserves metadata and propagates it to all chunks that are created from the same input string.
- Parameters
- inputs – input text to be split
- metadata (
dict
) – metadata associated with the input text - **kwargs – override for defaults set in the constructor
- Returns
list[tuple[str, dict]] –
List of pairs: (chunktext
,metadata
).Metadata are propagated to all chunks created from the same input string. If no metadata is provided an empty dictionary is used.