pw.xpacks.llm.splitters

A library of text spliiters - routines which slit a long text into smaller chunks.

class BaseSplitter(*, return_type=..., deterministic=False, propagate_none=False, executor=AutoExecutor(), cache_strategy=None)

[source]

Abstract base class for splitters that split a long text into smaller chunks.

__call__(text, **kwargs)

sourceSplit a given text into smaller chunks. Preserves metadata and propagates it to all chunks that are created from the same input string.

  • Parameters
    • text (ColumnExpression) – input column containing text to be split
    • **kwargs – override for defaults settings from the constructor
  • Returns
    pw.ColumnExpression
    A column of pairs: (chunk text, metadata).
      Metadata are propagated to all chunks created from the same input string.
      If no metadata is provided, an empty dictionary is used.
    

class NullSplitter(*, return_type=..., deterministic=False, propagate_none=False, executor=AutoExecutor(), cache_strategy=None)

[source]

A splitter which returns its argument as one long text ith null metadata.

  • Parameters
    txt – text to be split
  • Returns
    list of pairs – chunk text and metadata.

The null splitter always return a list of length one containing the full text and empty metadata.

__call__(text, **kwargs)

sourceSplit a given text into smaller chunks. Preserves metadata and propagates it to all chunks that are created from the same input string.

  • Parameters
    • text (ColumnExpression) – input column containing text to be split
    • **kwargs – override for defaults settings from the constructor
  • Returns
    pw.ColumnExpression
    A column of pairs: (chunk text, metadata).
      Metadata are propagated to all chunks created from the same input string.
      If no metadata is provided, an empty dictionary is used.
    

class TokenCountSplitter(min_tokens=50, max_tokens=500, encoding_name='cl100k_base')

[source]

Splits a given string or a list of strings into chunks based on token count.

This splitter tokenizes the input texts and splits them into smaller parts (“chunks”) ensuring that each chunk has a token count between min_tokens and max_tokens. It also attempts to break chunks at sensible points such as punctuation marks. Splitter expects input to be a Pathway column of strings OR pairs of strings and dict metadata.

All default arguments may be overridden in the UDF call

Example:

from pathway.xpacks.llm.splitters import TokenCountSplitter
import pathway as pw
t  = pw.debug.table_from_markdown(
    '''| text
1| cooltext'''
)
splitter = TokenCountSplitter(min_tokens=1, max_tokens=1)
t += t.select(chunks = splitter(pw.this.text))
pw.debug.compute_and_print(t, include_id=False)

__call__(text, **kwargs)

sourceSplit a given text into smaller chunks. Preserves metadata and propagates it to all chunks that are created from the same input string.

  • Parameters
    • text (ColumnExpression) – input column containing text to be split
    • **kwargs – override for defaults settings from the constructor
  • Returns
    pw.ColumnExpression
    A column of pairs: (chunk text, metadata).
      Metadata are propagated to all chunks created from the same input string.
      If no metadata is provided, an empty dictionary is used.
    

chunk(text, metadata={}, **kwargs)

sourceSplit a given string into smaller chunks. Preserves metadata and propagates it to all chunks that are created from the same input string.

  • Parameters
    • inputs – input text to be split
    • metadata (dict) – metadata associated with the input text
    • **kwargs – override for defaults set in the constructor
  • Returns
    list[tuple[str, dict]]
    List of pairs: (chunk text, metadata).
      Metadata are propagated to all chunks created from the same input string.
      If no metadata is provided an empty dictionary is used.