Batch processing vs stream processing
Batch processing and stream processing are the two distinct approaches to handling data: while batch processing involves processing data in discrete chunks or batches, stream processing deals with real-time data processing as data flows continuously through the system. While batch has become the de facto way of processing data, stream processing is not widely used, even when the processed data is a data stream from Apache Kafka. Why so? What is so complex about stream processing that people are mainly using batch processing?
Why would you care about streaming?
In this article, I will show you that stream processing matters by answering all the questions you have about batch processing vs stream processing:
By the end of this article, you shall be convinced that stream processing is more than the future of data processing!
What are batch processing and stream processing anyway?
First, before answering more complex questions, let's define exactly what we refer to by "batch" and "streaming" processing.
Batch processing.
Batch processing consists of collecting data over a period of time and then processing it all at once. Instead of handling each transaction or task individually as soon as it arrives, batch processing collects them into a batch or group and processes them together at a scheduled interval.
In batch processing, the data is called batch data and is supposed complete and static: it will not change during the whole computation. All the batch data is separated into batches that are then processed sequentially. A batch is a collection of data processed together as a single unit. This grouping allows for more efficient processing, as multiple items can be handled simultaneously rather than individually.
In batch processing, all the data should be available at the time of the computation. If the data is unbounded, only the available data at the time of the start of the computation is considered. This data is considered as complete and processed as a whole without time consideration. Whenever new data comes into the system, a new processing job should be done from scratch. It's commonly used in scenarios where data can be accumulated and processed offline, such as overnight or during periods of low system activity. This is usually used for tasks like large-scale data analysis, reporting, and ETL (Extract, Transform, Load) processes where it's more efficient to process large amounts of data in batches rather than one at a time.
Batch processing is usually used to periodically compute statistics over data that has been gathered during the two computations, such as daily analytics, which are computed overnight.
Stream processing.
Stream processing, also called data stream processing or streaming processing, is the continuous processing of data records as they arrive, in real-time. We say "real-time," but the exact term is "near real-time," as some latency is impossible to avoid. Stream processing aims to deliver the lowest latency possible between the arrival of new data and the update of the system.
In stream processing, the data is called streaming data or data streams and is processed as it arrives, often in small, manageable chunks or streams, without the need to store the entire dataset before processing. This approach enables immediate analysis, response, and decision-making based on the most recent data, making it well-suited for applications like real-time analytics, monitoring, and event-driven processing.
Stream processing is usually used for operational analytics. It allows monitoring of the system's evolution and triggering alerts in real time. For example, you may want to monitor live metrics using a dashboard during a product launch to adapt your strategy (republish, update the content, etc.) based on what happens. In those use cases, the timing is primordial.
Comparison: batch processing vs stream processing
The key difference between batch processing and stream processing lies in their handling of data: batch processing operates on finite datasets, processing them in discrete chunks or batches at scheduled intervals, while stream processing deals with data in real-time, handling it incrementally as it arrives, enabling immediate analysis and action without the need for storing the entire dataset beforehand.
Batch Processing | Stream Processing | |
---|---|---|
Nature of the data | Data considered as complete, without time consideration. | Data with time. |
Data Handling | Processes data in discrete batches at scheduled intervals. | Processes data in real-time as it arrives. |
Processing Time | Typically involves longer processing times due to batch accumulation. | Processes data immediately, resulting in faster processing times. |
Data Storage | Requires storing the entire dataset before processing. | Does not require storing the entire dataset beforehand. |
Use Cases | Suitable for periodic analytics. | Ideal for real-time analytics, monitoring, and event-driven processing. |
Latency | Higher latency as processing occurs at scheduled intervals. | Lower latency as processing is immediate upon data arrival. |
Resource Utilization | May require significant resources during batch processing intervals. | Resource-efficient, utilizing resources as data arrives. |
Typical use case | Periodic reports for marketing. | Operational analytics. |
Time and consistency
By handling real time live data, stream processing introduces a new notion in the computation: the time. The time of arrival of a data point has an impact on the system: late and out-of-order data points may invalidate previous computations. For example, suppose you want to send an alert if you receive 10 or more data points in less than one minute. A few minutes ago, you received 9 data points so you didn't trigger an alert. However, you receive, now, a data point that should have been received before: do you send an alert or not? In stream processing, late data points can have an impact and invalidate the results previously computed. If you were to redo the same computation over the same data stream but with different times of arrival, the results may not be consistent.
In batch processing, none of this exists as the data is supposed to be complete at the start of the computation. Data points are reordered and late data points will not be considered in this computation: the results are consistent according to the data you had at this time.
Why would I use streaming?
Given the previous definitions of batch and streaming, you could think that what you are currently doing, which is most likely batch, is perfect for your use case: you gather your data in your data lake or warehouse and periodically launch processing on the data you have gathered.
Then, why would you care about stream processing when everything in your project screams "batch"?
Actually, things are not that simple. Batch and streaming are more similar than they look at first. Let's explore and debunk the most common misconceptions about the differences between batch and streaming.
Why would I use streaming with my batch data?
Well… is your data really batch? This is the first question you might ask yourself. Why would you go streaming if your data is not a data stream?
But is your data really static? When you think about it, all the data we interact with is, in essence, a data stream. Data is typically generated and updated over time. The only really static datasets are the ones frozen in time that are shared for reproducibility and Kaggle competitions. There are only unbounded data streams.
Batch processing is simply a way to process large amounts of data at once as if the data were bounded, while stream processing embraces the inherent unboundedness of the data to process it live with as little latency as possible. New data stream processing frameworks such as Pathway use incrementality and micro-batching optimizations to avoid redundant computations while providing both high throughput with minimal latency.
In batch processing, new points are gathered before being processed, leading to potentially high latency. On the other hand, stream processing has low latency as new data points are processed upon reception.
Why would I use something as complex as streaming?
Even if your data is a data stream, batch processing seems appealing: it saves you from handling time and consistency concerns. It's because of this simplicity that batch processing is the de facto standard for data processing. Why would you use something as complex as stream processing?
While batch processing may appear simpler initially, it is not the case anymore. Stream processing frameworks have greatly matured, significantly lowering the learning curve for stream processing using simplified APIs. All the challenges of stream processing, such as handling late and out-of-order data, are handled by the engine and hidden from the user. Thanks to those new frameworks, stream processing is now as easy as batch processing from the user's perspective.
Time and consistency are still important in stream processing and those frameworks let you configure consistency, i.e., how to deal with late and out-of-order data but their default configuration is similar to batch processing. For instance, the default behavior of Pathway is to return an output in real-time, which is what you would have if you were processing the received data using batch processing. In particular, Pathway updates its former results whenever a data point arrives late. You can manually change it so that data points received after a cut-off are ignored, but the default configuration lets you ignore all the time considerations.
Stream processing even makes some tasks easier by providing temporal operations such as as-of-join and temporal windows. Of course, those are technically possible in batch processing but at a greater development cost. Stream processing offers temporal operations that make development way easier.
Triggering an alert whenever more than 10 data points are received in less than a minute is easy with a sliding window.
Why would I break my architecture to use streaming?
What if you have already deployed a working batch-processing pipeline? Migrating to stream processing might look like a lot of effort. Your co-workers (or project manager) are not gonna easily accept a change in the architecture.
Fortunately, streaming architecture is closer to batch than you may think: your system is likely to be halfway there.
- Your data is already a data stream. As previously explained, data sources generate data streams, so you are likely already dealing with data streams. In fact, 75% of all Fortune 500 companies are using Apache Kafka. Yes, people are doing batch processing on data streams. If you are using a regular DBMS, you can easily add a Change Data Capture (CDC) mechanism to send the changes happening in your database to your processing engine as a data stream.
- Streaming does not require an extra layer in your architecture. The days of the lambda architecture are over! With a single data processing framework such as Pathway, you can now handle both batch and streaming. Moreover, existing tools provide APIs to the most commonly used languages, such as Python and SQL. Using Python may simplify your architecture as Python is one of the most used languages for ETL, and it is the AI/ML science language. It's highly probable that people already have some Python data engineers and some part of the pipeline in Python: in that case, using the same language for a larger part of the pipeline makes it easier to develop, test, and maintain.
Switching to streaming has many benefits as it saves you from translating the data stream into batch data instead of processing it in a streaming fashion. Furthermore, if you already have a hybrid approach to handle both batch and streaming, using a unified framework can only make things easier for you.
Why would you still use batch? It seems quite expensive…
Now, you should be convinced that streaming is relevant and easy. However, the trade-off might still favor batch processing, especially if you already have a deployed batch processing pipeline. If everyone is happy with the current batch processing, why would you change?
The real question is, "Why would you keep using batch?". The main reason for switching to streaming is the cost. In the long run, batch processing will cost more in both direct and indirect costs.
Why computing, again and again, the same metrics over the same (sample) of data?
The basis of batch processing is to periodically do the same computation on all the available data. When doing a new computation, the data that was already present in the previous one will be reprocessed. While this is not necessarily the case, in practice, you want to include them to correct previous analyses (wrong statistics because of late data, for example). So you end up processing the same data again and again. This is a waste of resources.
Of course, you could optimize your implementation and only process the relevant data… this is exactly what streaming is about! Stream processing processes data incrementally and in real-time as it arrives: it only reprocesses the relevant data, leaving the rest untouched for better resource usage. Data stream processing frameworks such as Pathway now use advanced optimization techniques, including incrementality and micro-batching, providing better performance than their batch counterparts. Streaming optimizes the code for you, allowing you to use fewer resources. And this may have an important impact on your cloud bill.
What is the cost of using outdated/wrong analytics?
The fundamental difference between batch and streaming is how the computations are done: batch relies on periodic "once for all" computations while streaming outputs in real-time a continuous flow of updates. We can wonder what happens between two successive batch computations: your data may change while your results don't. In other words, your results are possibly outdated or totally wrong between two batch computations. In general, the whole point of the data pipeline is to serve those results, be it analytics, alerts, or reports. The quality of those results should be your top priority. Providing results on outdated or wrong data can be harmful and have a cost to your business.
A simple solution would be to increase the frequency of the processing to be sure that the results are based on relatively fresh data. Yes, sure, but how often should you relaunch your computation? Finding the correct frequency depends on many factors, such as business logic or the size of the data. Finding the sweet spot is hard. Streaming processing handles this for you by maintaining the results up-to-date whenever data comes in.
Why pick one when you can have both?
A standard scenario is to start with a daily computation and then shortening more and more the time between each refresh because of an increased data load. Batch processing was a reasonable choice at the beginning, but running this job more and more often and on more and more data will be difficult by sticking to batch. The use case has slowly migrated from batch to streaming! You will either end up doing poorly optimized streaming with your batch implementation (see the above point) or switching to a proper streaming tool. Either way, it'll be painful. You don't have to wait before switching to a streaming framework. Indeed, with unified data processing frameworks such as Pathway, the same code can be used for both batch processing and stream processing. Those frameworks rely on powerful engines suitable for both scenarios. You can first use your code to compute your analytics as a batch processing job, and you can seamlessly switch to a streaming setup whenever it makes sense without changing anything in your code.
Batch processing vs stream processing: stream processing for the win!
Batch processing has become the de facto standard of data processing because of stream processing's inherent complexity. While streaming is more expressive and adapted to many use cases, people often prefer batch processing's apparent simplicity. Fortunately, the stream processing tools have matured, making stream processing as easy as batch processing.
Unified data processing frameworks such as Pathway allow you to use the same code for batch and streaming. All the complexity, including late data and consistency, are automatically handled and hidden from the user. Those frameworks provide advanced streaming operations, such as temporal windows, while keeping the simplicity of batch processing. This makes the pipelines easier to develop and maintain while optimizing the resources used.
Batch Processing | stream processing | |
---|---|---|
Nature of the data | Streaming data | Streaming data |
Simplicity of use | ✅ | ✅ |
Ease of deployment | ❌ | ✅ |
Freshness of the results | ❌ | ✅ |
Cost | ❌ | ✅ |
Batch processing was only a workaround to avoid the complexity of stream processing. The incoherence of the results between two computations was a small price to pay to avoid the complex deployment of stream processing. Now that new tools have made stream processing as accessible as batch processing: why would you keep paying this price?