pw.xpacks.llm.splitters

A library of text spliiters - routines which slit a long text into smaller chunks.

class BaseSplitter(*, return_type=..., deterministic=False, propagate_none=False, executor=AutoExecutor(), cache_strategy=None)

[source]

Abstract base class for splitters that split a long text into smaller chunks.

__call__(text, **kwargs)

sourceSplit a given text into smaller chunks. Preserves metadata and propagates it to all chunks that are created from the same input string.

  • Parameters
    • text (ColumnExpression) – input column containing text to be split
    • **kwargs – override for defaults settings from the constructor
  • Returns
    pw.ColumnExpression
    A column of pairs: (chunk text, metadata).
      Metadata are propagated to all chunks created from the same input string.
      If no metadata is provided, an empty dictionary is used.
    

class NullSplitter(*, return_type=..., deterministic=False, propagate_none=False, executor=AutoExecutor(), cache_strategy=None)

[source]

A splitter which returns its argument as one long text ith null metadata.

  • Parameters
    txt – text to be split
  • Returns
    list of pairs – chunk text and metadata.

The null splitter always return a list of length one containing the full text and empty metadata.

__call__(text, **kwargs)

sourceSplit a given text into smaller chunks. Preserves metadata and propagates it to all chunks that are created from the same input string.

  • Parameters
    • text (ColumnExpression) – input column containing text to be split
    • **kwargs – override for defaults settings from the constructor
  • Returns
    pw.ColumnExpression
    A column of pairs: (chunk text, metadata).
      Metadata are propagated to all chunks created from the same input string.
      If no metadata is provided, an empty dictionary is used.
    

class RecursiveSplitter(chunk_size=500, chunk_overlap=0, separators=SEPARATORS, is_separator_regex=False, encoding_name=None, model_name=None, hf_tokenizer=None)

[source]

Splitter that splits a long text into smaller chunks based on a set of separators. Chunking is performed recursively using first separator in the list and then second separator in the list and so on, until the text is split into chunks of length smaller than chunk_size. Length of the chunks is measured by the number of characters in the text if none of encoding_name, model_name or hf_tokenizer is provided. Otherwise, the length of the chunks is measured by the number of tokens that particular tokenizer would output.

Under the hood it is a wrapper around langchain_text_splitters.RecursiveTextSplitter (MIT license).

  • Parameters
    • chunk_size (int) – maximum size of a chunk in characters/tokens.
    • chunk_overlap (int) – number of characters/tokens to overlap between chunks.
    • separators (list[str]) – list of strings to split the text on.
    • is_separator_regex (bool) – whether the separators are regular expressions.
    • encoding_name (str | None) – name of the encoding from tiktoken. For the list of available encodings please refer to tiktoken documentation: https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken
    • model_name (str | None) – name of the model from tiktoken. See the link above for more details.
    • hf_tokenizer (PreTrainedTokenizerBase | None) – Huggingface tokenizer to use for tokenization.

__call__(text, **kwargs)

sourceSplit a given text into smaller chunks. Preserves metadata and propagates it to all chunks that are created from the same input string.

  • Parameters
    • text (ColumnExpression) – input column containing text to be split
    • **kwargs – override for defaults settings from the constructor
  • Returns
    pw.ColumnExpression
    A column of pairs: (chunk text, metadata).
      Metadata are propagated to all chunks created from the same input string.
      If no metadata is provided, an empty dictionary is used.
    

class TokenCountSplitter(min_tokens=50, max_tokens=500, encoding_name='cl100k_base')

[source]

Splits a given string or a list of strings into chunks based on token count.

This splitter tokenizes the input texts and splits them into smaller parts (“chunks”) ensuring that each chunk has a token count between min_tokens and max_tokens. It also attempts to break chunks at sensible points such as punctuation marks. Splitter expects input to be a Pathway column of strings OR pairs of strings and dict metadata.

All default arguments may be overridden in the UDF call

Example:

from pathway.xpacks.llm.splitters import TokenCountSplitter
import pathway as pw
t  = pw.debug.table_from_markdown(
    '''| text
1| cooltext'''
)
splitter = TokenCountSplitter(min_tokens=1, max_tokens=1)
t += t.select(chunks = splitter(pw.this.text))
pw.debug.compute_and_print(t, include_id=False)

__call__(text, **kwargs)

sourceSplit a given text into smaller chunks. Preserves metadata and propagates it to all chunks that are created from the same input string.

  • Parameters
    • text (ColumnExpression) – input column containing text to be split
    • **kwargs – override for defaults settings from the constructor
  • Returns
    pw.ColumnExpression
    A column of pairs: (chunk text, metadata).
      Metadata are propagated to all chunks created from the same input string.
      If no metadata is provided, an empty dictionary is used.
    

chunk(text, metadata={}, **kwargs)

sourceSplit a given string into smaller chunks. Preserves metadata and propagates it to all chunks that are created from the same input string.

  • Parameters
    • inputs – input text to be split
    • metadata (dict) – metadata associated with the input text
    • **kwargs – override for defaults set in the constructor
  • Returns
    list[tuple[str, dict]]
    List of pairs: (chunk text, metadata).
      Metadata are propagated to all chunks created from the same input string.
      If no metadata is provided an empty dictionary is used.