blogengineering

50× Faster Local Embeddings with Batch UDFs

Szymon Dudycz

·Published July 17, 2025·Updated July 17, 2025·0 min read

Local embeddings are important for your real-time AI applications from RAG to similarity search. They let you generate vectors on your own infrastructure, keeping latency low and your data private. But if your pipeline embeds data one item at a time, it will struggle to keep up when the volume grows. We saw this bottleneck where embedding documents sequentially was far too slow. Our solution was to introduce batch UDFs, a change that makes local embedding generation 50× faster without sacrificing streaming performance. In this post, we’ll show you how we did it and what it means for your pipelines.

The Local Embeddings Performance Challenge

In production streaming systems, locally computed embeddings pose a unique performance challenge. Unlike cloud API embeddings (by providers like OpenAI, Cohere, etc.) which can batch requests internally, local embedding models typically process one item at a time. As a result, much of your hardware’s potential throughput goes unused. When you need to embed thousands of documents per second, sequential processing can grind the pipeline to a halt. This is especially problematic for use cases such as realtime semantic search or retrieval augmented generation (RAG), where new data must be indexed within seconds to stay useful.

Why are batch UDFs needed?

The User Defined Functions (UDFs) in the Pathway engine are designed to operate on each row separately. For some operations, however, it make sense to batch the computation, even though there would be no change in the result. An example of this is matrix multiplication vs matrix by vector multiplication - multiplication of two n x n matrices is faster than n multiplications of n x n matrix by a n-length vector. So, when tasked with matrix by vector multiplication, it makes sense to combine all the vectors into one matrix, do a matrix multiplication and then extract results for each vector.

With the prominence of matrices in the machine learning models, these are natural candidates for getting an improvement from batching. While this doesn’t matter for API-based embedding models (they handle batching on the server side), we needed to add batching for locally computed embeddings. In our codebase, this meant enhancing the SentenceTransformerEmbedder (which uses the Sentence Transformers library to generate embeddings) and HFPipelineChat (a wrapper around the Transformers library).

What we did in Pathway

This motivated us to expand our UDFs, which led to the introduction of batch UDFs. When these are used, the engine allows the UDF to send multiple rows of data simultaneously, with the expectation that the UDF will operate on lists of data points. Furthermore, the changes made for the sake of UDFs are more general and allow us to optimize other operators that can benefit from batching.

As Pathway is written with time consistency in mind, the UDF batching is only possible for rows that have the same processing time. This, however, is not a drawback, since with only single points of data coming to the pipeline, the improved performance is not needed.

Benchmarking performance of local embeddings

We ran two tests to measure how batch UDFs accelerate local embedding generation:

1000 sentences – embedding 1,000 diverse sentences generated by an LLM.
575 Wikipedia articles – embedding a collection of Wikipedia articles (about 3 million tokens in total).

Each test was run using the intfloat/e5-large-v2 model with three settings: no batching, a batch size of 32, and a large batch (all items at once: 1000 for sentences, 575 for articles). The table below shows the total embedding time in seconds for each scenario:

	without batching	batches of size 32	batches of size 1024
articles on wikipedia	647.716	342.236	239.721
1000 sentences	2536.602	130.530	43.608

As you can see, especially for the sentences, which are mostly consistent in length, the improvement is immense, with computations without batches taking over 50 times more time than with batches of size 1024.

How to use Batch UDFs

While the introduction of the batch UDFs was motivated by its application to RAGs, these are much more general and you can use them anywhere you expect to get a performance improvement from batching. To do that, set the max_batch_size in the pw.udf decorator:

@pw.udf(max_batch_size=32)
def batched_udf(...):
  ...

or in the UDF constructor

class BatchedUDF(pw.UDF):
  def __init__(self):
    super().__init__(max_batch_size=32)
  
  # implementation of udf
  def __wrapped(self, ...)

and change your UDF function to operate on lists of arguments.

More details on using the batch UDFs are in the dedicated guide on the UDFs.

Szymon Dudycz

Algorithm and Data Processing Magician

Power your RAG and ETL pipelines with Live Data

Get started for free

Blog

How La Poste uses Pathway microservices to deliver high-quality ETAs

Blog

Adaptive Agents for Real-Time RAG: Domain-Specific AI for Legal, Finance & Healthcare