pw.xpacks.llm.embedders

Pathway embedder UDFs.

class GeminiEmbedder(*, capacity=None, retry_strategy=None, cache_strategy=None, model='models/embedding-001', api_key=None, **gemini_kwargs)

Pathway wrapper for Google Gemini Embedding services.

The capacity, retry_strategy and cache_strategy need to be specified during object construction. All other arguments can be overridden during application. Gemini API truncates the content in case the text length is larger than model’s context length.

Parameters
- capacity (int | None) – Maximum number of concurrent operations allowed. Defaults to None, indicating no specific limit.
- retry_strategy (AsyncRetryStrategy | None) – Strategy for handling retries in case of failures. Defaults to None, meaning no retries.
- cache_strategy (CacheStrategy | None) – Defines the caching mechanism. To enable caching, a valid CacheStrategy should be provided. See Cache strategy for more information. Defaults to None.
- model (str | None) – ID of the model to use. Check the Gemini documentation for list of available models. To specify the model in the UDF call, set it to None in the constructor.
- api_key (str | None) – API key for Gemini API services. Can be provided in the constructor, in __call__ or by setting GOOGLE_API_KEY environment variable
- gemini_kwargs – any other arguments accepted by gemini embedding service. Check the Gemini documentation for list of accepted arguments.

Example:

import pathway as pw
from pathway.xpacks.llm import embedders
embedder = embedders.GeminiEmbedder(model="models/text-embedding-004")
t = pw.debug.table_from_markdown('''
txt
Text
''')
t.select(ret=embedder(pw.this.txt))

import pathway as pw
from pathway.xpacks.llm import embedders
embedder = embedders.GeminiEmbedder()
t = pw.debug.table_from_markdown('''
txt  | model
Text | models/embedding-001
''')
t.select(ret=embedder(pw.this.txt, model=pw.this.model))

call(input, *args, **kwargs)

sourceEmbeds texts in a Column.

Parameters
input (ColumnExpression[str]) – Column with texts to embed

get_embedding_dimension(**kwargs)

sourceComputes number of embedder’s dimensions by asking the embedder to embed ".".

Parameters
**kwargs – parameters of the embedder, if unset defaults from the constructor will be taken.

class LiteLLMEmbedder(*, capacity=None, retry_strategy=None, cache_strategy=None, model=None, **llmlite_kwargs)

[source]

Pathway wrapper for litellm.embedding.

Model has to be specified either in constructor call or in each application, no default is provided. The capacity, retry_strategy and cache_strategy need to be specified during object construction. All other arguments can be overridden during application.

Parameters
- capacity (int | None) – Maximum number of concurrent operations allowed. Defaults to None, indicating no specific limit.
- retry_strategy (AsyncRetryStrategy | None) – Strategy for handling retries in case of failures. Defaults to None, meaning no retries.
- cache_strategy (CacheStrategy | None) – Defines the caching mechanism. To enable caching, a valid CacheStrategy should be provided. See Cache strategy for more information. Defaults to None.
- model (str | None) – The embedding model to use.
- timeout – The timeout value for the API call, default 10 mins
- litellm_call_id – The call ID for litellm logging.
- litellm_logging_obj – The litellm logging object.
- logger_fn – The logger function.
- api_base – Optional. The base URL for the API.
- api_version – Optional. The version of the API.
- api_key – Optional. The API key to use.
- api_type – Optional. The type of the API.
- custom_llm_provider – The custom llm provider.

Any arguments can be provided either to the constructor or in the UDF call. To specify the model in the UDF call, set it to None.

Example:

import pathway as pw
from pathway.xpacks.llm import embedders
embedder = embedders.LiteLLMEmbedder(model="text-embedding-3-small")
t = pw.debug.table_from_markdown('''
txt
Text
''')
t.select(ret=embedder(pw.this.txt))

import pathway as pw
from pathway.xpacks.llm import embedders
embedder = embedders.LiteLLMEmbedder()
t = pw.debug.table_from_markdown('''
txt  | model
Text | text-embedding-3-small
''')
t.select(ret=embedder(pw.this.txt, model=pw.this.model))

call(input, *args, **kwargs)

sourceEmbeds texts in a Column.

Parameters
input (ColumnExpression[str]) – Column with texts to embed

get_embedding_dimension(**kwargs)

sourceComputes number of embedder’s dimensions by asking the embedder to embed ".".

Parameters
**kwargs – parameters of the embedder, if unset defaults from the constructor will be taken.

class OpenAIEmbedder(*, capacity=None, retry_strategy=None, cache_strategy=None, model='text-embedding-3-small', truncation_keep_strategy='start', **openai_kwargs)

[source]

Pathway wrapper for OpenAI Embedding services.

The capacity, retry_strategy and cache_strategy need to be specified during object construction, and API key must be provided to the constructor with the api_key argument or set in the OPENAI_API_KEY environment variable. All other arguments can be overridden during application.

Parameters
- capacity (int | None) – Maximum number of concurrent operations allowed. Defaults to None, indicating no specific limit.
- retry_strategy (AsyncRetryStrategy | None) – Strategy for handling retries in case of failures. Defaults to None, meaning no retries.
- cache_strategy (CacheStrategy | None) – Defines the caching mechanism. To enable caching, a valid CacheStrategy should be provided. See Cache strategy for more information. Defaults to None.
- model (str | None) – ID of the model to use. You can use the List models API to see all of your available models, or see Model overview for descriptions of them.
- api_key – API key to be used for API calls to OpenAI. It must be either provided in the constructor or set in the OPENAI_API_KEY environment variable.
- truncation_keep_strategy (Optional[Literal['start', 'end']]) – Strategy to keep the part of the text if truncation is necessary. If set, only documents that are longer than model’s supported context will be truncated. Can be "start", "end" or None. "start" will keep the first part of the text and remove the rest. "end" will keep the last part of the text. If None, no truncation will be applied to any of the documents, this may cause API exceptions.
- encoding_format – The format to return the embeddings in. Can be either float or base64.
- user – A unique identifier representing your end-user, which can help OpenAI to monitor and detect abuse. Learn more.
- extra_headers – Send extra headers
- extra_query – Add additional query parameters to the request
- extra_body – Add additional JSON properties to the request
- timeout – Timeout for requests, in seconds

Any arguments can be provided either to the constructor or in the UDF call. To specify the model in the UDF call, set it to None.

Example:

import pathway as pw
from pathway.xpacks.llm import embedders
embedder = embedders.OpenAIEmbedder(model="text-embedding-3-small")
t = pw.debug.table_from_markdown('''
txt
Text
''')
t.select(ret=embedder(pw.this.txt))

import pathway as pw
from pathway.xpacks.llm import embedders
embedder = embedders.OpenAIEmbedder()
t = pw.debug.table_from_markdown('''
txt  | model
Text | text-embedding-3-small
''')
t.select(ret=embedder(pw.this.txt, model=pw.this.model))

call(input, *args, **kwargs)

sourceEmbeds texts in a Column.

Parameters
input (ColumnExpression[str]) – Column with texts to embed

get_embedding_dimension(**kwargs)

sourceComputes number of embedder’s dimensions by asking the embedder to embed ".".

Parameters
**kwargs – parameters of the embedder, if unset defaults from the constructor will be taken.

static truncate_context(model, text, strategy)

sourceMaybe truncate the given text from the end, or from the start. "strategy" determines which part of the text will be kept.

class SentenceTransformerEmbedder(model, call_kwargs={}, device='cpu', batch_size=1024, **sentencetransformer_kwargs)

[source]

Pathway wrapper for Sentence-Transformers embedder.

Parameters
- model (str) – model name or path
- call_kwargs (dict) – kwargs that will be passed to each call of encode. These can be overridden during each application. For possible arguments check the Sentence-Transformers documentation.
- device (str) – defines which device will be used to run the Pipeline
- batch_size (int) – maximum size of a single batch to be sent to the embedder. Bigger batches may reduce the time needed for embedding, especially on GPU.
- sentencetransformer_kwargs – kwargs accepted during initialization of SentenceTransformers. For possible arguments check the Sentence-Transformers documentation

Example:

import pathway as pw  
from pathway.xpacks.llm import embedders  
embedder = embedders.SentenceTransformerEmbedder(model="intfloat/e5-large-v2")  
t = pw.debug.table_from_markdown('''
txt
Text
''')  
t.select(ret=embedder(pw.this.txt))

call(input, *args, **kwargs)

sourceEmbeds texts in a Column.

Parameters
input (ColumnExpression[str]) – Column with texts to embed

get_embedding_dimension(**kwargs)

sourceComputes number of embedder’s dimensions by asking the embedder to embed ".".

Parameters
**kwargs – parameters of the embedder, if unset defaults from the constructor will be taken.