Embedders
When storing a document in a vector store, you compute the embedding vector for the text and store the vector with a reference to the original document. You can then compute the embedding of a query and find the embedded documents closest to the query.
The following embedding wrappers are available through the Pathway xpack:
OpenAIEmbedder
- Embed text with any of OpenAI's embedding modelsLiteLLMEmbedder
- Embed text with any model available through LiteLLMSentenceTransformersEmbedder
- Embed text with any model available through SentenceTransformer (aka. SBERT) maintained by Hugging FaceGeminiEmbedder
- Embed text with any of Google's available embedding models
OpenAIEmbedder
The default model for OpenAIEmbedder
is text-embedding-3-small
.
import os
import pathway as pw
from pathway.xpacks.llm.parsers import UnstructuredParser
from pathway.xpacks.llm.embedders import OpenAIEmbedder
files = pw.io.fs.read(
os.environ.get("DATA_DIR"),
mode="streaming",
format="binary",
autocommit_duration_ms=50,
)
# Parse the documents in the specified directory
parser = UnstructuredParser(chunking_mode="paged")
documents = files.select(elements=parser(pw.this.data))
documents = documents.flatten(pw.this.elements) # flatten list into multiple rows
documents = documents.select(text=pw.this.elements[0], metadata=pw.this.elements[1])
# Embed each page of the document
embedder = OpenAIEmbedder(api_key=os.environ["OPENAI_API_KEY"])
embeddings = documents.select(embedding=embedder(pw.this.text))
LiteLLMEmbedder
The model for LiteLLMEmbedder
has to be specified during initialization. No default is provided.
from pathway.xpacks.llm import embedders
embedder = embedders.LiteLLMEmbedder(
model="text-embedding-3-small", api_key=API_KEY
)
# Create a table with one column for the text to embed
t = pw.debug.table_from_markdown(
"""
text_column
Here is some text
"""
)
res = t.select(ret=embedder(pw.this.text_column))
SentenceTransformerEmbedder
This SentenceTransformerEmbedder
embedder allows you to use the models from the Hugging Face Sentence Transformer models.
The model is specified during initialization. Here is a list of available models
.
import pathway as pw
from pathway.xpacks.llm import embedders
embedder = embedders.SentenceTransformerEmbedder(model="intfloat/e5-large-v2")
# Create a table with text to embed
t = pw.debug.table_from_markdown('''
txt
Some text to embed
''')
# Extract the embedded text
t.select(ret=embedder(pw.this.txt))
GemeniEmbedder
GemeniEmbedder
is the embedder for Google's Gemeni Embedding Services. Available models can be found here
.
import pathway as pw
from pathway.xpacks.llm import embedders
embedder = embedders.GeminiEmbedder(model="models/text-embedding-004")
# Create a table with a column for the text to embed
t = pw.debug.table_from_markdown('''
txt
Some text to embed
''')
t.select(ret=embedder(pw.this.txt))