pw.xpacks.llm.rerankers

class CrossEncoderReranker(model_name, *, cache_strategy=None, **init_kwargs)

[source]

Pointwise Cross encoder reranker module.

Uses the CrossEncoder from the sentence_transformers library. For reference, check out Cross encoders documentation

  • Parameters
    • model_name (str) – Embedding model to be used.
    • cache_strategy (CacheStrategy | None) – Defines the caching mechanism. To enable caching, a valid CacheStrategy should be provided. See Cache strategy for more information. Defaults to None.

Suggested model: cross-encoder/ms-marco-TinyBERT-L-2-v2

Example:

import pathway as pw
import pandas as pd
from pathway.xpacks.llm import rerankers
reranker = rerankers.CrossEncoderReranker(model_name="cross-encoder/ms-marco-TinyBERT-L-2-v2")
docs = [{"text": "Something"}, {"text": "Something else"}, {"text": "Pathway"}]
df = pd.DataFrame({"docs": docs, "prompt": "query text"})
table = pw.debug.table_from_pandas(df)
table += table.select(
    reranker_scores=reranker(pw.this.docs["text"], pw.this.prompt)
)
table

__call__(doc, query, **kwargs)

sourceEvaluates the doc against the query.

  • Parameters
    • doc (pw.ColumnExpression[str]) – Document or document chunk to be scored.
    • query (pw.ColumnExpression[str]) – User query or prompt that will be used to evaluate relevance of the doc.
    • **kwargs – override for defaults set in the constructor.

class EncoderReranker(model_name, *, cache_strategy=None, **init_kwargs)

[source]

Pointwise encoder reranker module.

Uses the encoders from the sentence_transformers library. For reference, check out Pretrained models documentation

  • Parameters
    • model_name (str) – Embedding model to be used.
    • cache_strategy (CacheStrategy | None) – Defines the caching mechanism. To enable caching, a valid CacheStrategy should be provided. See Cache strategy for more information. Defaults to None.

Suggested model: BAAI/bge-large-zh-v1.5

Example:

import pathway as pw
import pandas as pd
from pathway.xpacks.llm import rerankers
reranker = rerankers.EncoderReranker(model_name="BAAI/bge-large-zh-v1.5")
docs = [{"text": "Something"}, {"text": "Something else"}, {"text": "Pathway"}]
df = pd.DataFrame({"docs": docs, "prompt": "query text"})
table = pw.debug.table_from_pandas(df)
table += table.select(
    reranker_scores=reranker(pw.this.docs["text"], pw.this.prompt)
)
table

__call__(doc, query, **kwargs)

sourceEvaluates the doc against the query.

  • Parameters
    • doc (pw.ColumnExpression[str]) – Document or document chunk to be scored.
    • query (pw.ColumnExpression[str]) – User query or prompt that will be used to evaluate relevance of the doc.
    • **kwargs – override for defaults set in the constructor.

class LLMReranker(llm, *, retry_strategy=udfs.ExponentialBackoffRetryStrategy(max_retries=6), cache_strategy=None, use_logit_bias=None)

[source]

Pointwise LLM reranking module.

Asks LLM to evaluate a given doc against a query between 1 and 5.

  • Parameters
    • llm (BaseChat) – Chat instance to be called during reranking.
    • retry_strategy (AsyncRetryStrategy | None) – Strategy for handling retries in case of failures. Defaults to None, meaning no retries.
    • cache_strategy (CacheStrategy | None) – Defines the caching mechanism. To enable caching, a valid CacheStrategy should be provided. See Cache strategy for more information. Defaults to None.
    • use_logit_bias (bool | None) – bool or None. Setting it as None checks if the LLM provider supports logit_bias argument, it can be overridden by setting it as True or False. Defaults to None.

Example:

import pathway as pw
import pandas as pd
from pathway.xpacks.llm import rerankers, llms
chat = llms.OpenAIChat(model="gpt-3.5-turbo")
reranker = rerankers.LLMReranker(chat)
docs = [{"text": "Something"}, {"text": "Something else"}, {"text": "Pathway"}]
df = pd.DataFrame({"docs": docs, "prompt": "query text"})
table = pw.debug.table_from_pandas(df)
table += table.select(
    reranker_scores=reranker(pw.this.docs["text"], pw.this.prompt)
)
table

__call__(doc, query, **kwargs)

sourceEvaluates the doc against the query.

  • Parameters
    • doc (pw.ColumnExpression[str]) – Document or document chunk to be scored.
    • query (pw.ColumnExpression[str]) – User query or prompt that will be used to evaluate relevance of the doc.
    • **kwargs – override for defaults set in the constructor

rerank_topk_filter(docs, scores, k=5)

sourceApply top-k filtering to docs using the relevance scores.

  • Parameters
    • docs (list[dict[str, str | dict]]) – A column with lists of documents or chunks to rank. Each row in this column is filtered separately.
    • scores (list[float]) – A column with lists of re-ranking scores for chunks.
    • k (int) – The number of documents to keep after filtering.
import pathway as pw
from pathway.xpacks.llm import rerankers
import pandas as pd
retrieved_docs = [
    {"text": "Something"},
    {"text": "Something else"},
    {"text": "Pathway"},
]
df = pd.DataFrame({"docs": retrieved_docs, "reranker_scores": [1.0, 3.0, 2.0]})
table = pw.debug.table_from_pandas(df)
docs_table = table.reduce(
    doc_list=pw.reducers.tuple(pw.this.docs),
    score_list=pw.reducers.tuple(pw.this.reranker_scores),
)
docs_table = docs_table.select(
    docs_scores_tuple=rerankers.rerank_topk_filter(
        pw.this.doc_list, pw.this.score_list, 2
    )
)
docs_table = docs_table.select(
    doc_list=pw.this.docs_scores_tuple[0],
    score_list=pw.this.docs_scores_tuple[1],
)
pw.debug.compute_and_print(docs_table, include_id=False)