When to go from Batch to Stream Processing
Are your customers waiting days and weeks for their data?
Today’s issue of Crafting Tech Teams is all about stream processing, notably its contrast to the more traditional batch processing. This is part of a series about event sourcing and event modeling, partly inspired by the book Designing Data-Intensive Applications.
Batch vs Streaming
The heavy lifting of large scans comes down somewhere between analytics and transaction processing. The output of Batch operations and workflows as normally another structure of some kind that is used as an index—or a projection—of the underlying raw data.
Schema considerations
This is especially useful if the raw data gets discarded regularly because it retains a much more structured aggregation with a lower storage footprint.
Batch processing comes with its downsides though. When performing all-or-nothing operations, the scatter-gather nature of MapReduce-like jobs will create a synchronous hotspot for write access on the system’s databases.
“[However] in practice, it appears that simply making data available quickly— even if it is in a quirky, difficult-to-use, raw format—is often more valuable than trying to decide on the ideal data model up front.”
— Designing Data Intensive Applications, p.415
Batching Window Drawbacks
Batching workflows require the input data to be bounded. Especially when there is sorting or grouping involved with the input data (ie. extracting a repeat customer or address from a list of orders).
Daily batches may run once per day at the end of the day
Weekly batches may run once per week at the end of the week
You can also run weekly batches every day for the last 7 days (sliding window)
Batching Hotspots
The more you optimise, speed up and parallelise this process, the more the database will be stressed. Often the desired property of a batch workflow is to reduce its total running time while also increasing its frequency to accommodate incremental changes and updates, making it appear more live.
There are methods of modern massive parallel processing databases that handle this, especially on the MLOps data engineering side with cloud-based scaling.
Combined with an increasing repertoire of data and schema changes to the live system over the years this creates a hotspot of complexity and expenses.
Stream Processing seeks to solve this latter problem.
Stream Processing
The challenges of Unbounded data
Users want to see the results of their actions. Especially when intermediary steps are taken based on useful derivative data from their previous steps. When the only way to get that information is through a long-lasting batch process it creates friction. We discussed this in Batching Hotspots.
Evolving from Batching to Streaming
There is a pattern to this data engineering evolution that I like. I encountered it many times throughout my career and the book highlights the correlated terms very well:
Grouping
Batches take a file or query result and split it into records. Streaming starts at the record level—calling the events—and groups them into streams.
Polling
A batched data processing system checks for new data using a polling mechanism. In streaming systems the polling overhead can become very expensive and listening to live (push) updates becomes much cheaper.
Singular Coupling
Network streams to open a file or serve a query often produce a 1-to-1 relation with the processing system in batch mode. However, the most popular streaming solution to event listening are message systems that support one-to-many consumption or even production.
Webhooks
Unique to streams, when a message consumer has a public-facing service interface, we can expose it directly for record-keeping as a webhook. However, this is not the same as a simple REST endpoint. The endpoint serves no purpose other than recording the message sent in its queue.
Downsides of Streaming
Acknowledgements or ACK messages as they are commonly known are possibly the hardest hurdle in the mental shift towards data streaming. The queues from message broker databases like Kafka, Redis and even SQL-backed outboxes give a false sense of continuity and sorted behavior for the dataset.
Unfortunately, asynchronous consumption through multiple unreliable brokers—all asynchronous brokers are unreliable!—breaks this promise quickly and engineers are forced to adopt new strategies to deal with this eventual consistency problem.
Usually most cases are handled by adopting one of these two compromises:
At-most-once delivery (or processing, depending on what you optimise for)
At-least-once delivery (or processing, depending on what you optimise for)
At least once has the upside of guaranteeing message arrival via redundancy or extra durability provided by partitioning. The downside is that this is only possible by retrying aggressively, which can cause messages to be delivered more than once (and possibly out of order in a given context). A data pipeline that has upstream back-pressure towards the producers can experience a variation of the Two Generals’ Problem.
At most once delivery generally focused on dropping messages as fast as possible to ensure continued operation and low latency. This is commonly done by adopting a no-ACK strategy and treating the queue like a Pub/Sub broadcaster that has to be listened to live. The downside is obvious—if you weren’t listening in while the message was broadcast, it never reaches you.
Editor’s note: This is of course an oversimplification. Modern brokers with atomic ACK messages can alleviate most—but not all—visible risks.
Event Modeling is the Swiss-army-knife fit for Streaming Evolution
Modeling all these streams and brokers can easily get out of hand—especially when the data is being used for further OLTP processing rather than just analytics. When automated data lifecycles are necessary with a client-facing UI event modeling becomes a strong contender.
It provides clear documentation, process analysis and just enough level of detail to find the tricky details introduced by stream processing while making the important elements of a streaming data pipeline visible:
Data sharing
Data dependencies
Processing order
Derivative Projections and UI updates
Commands and webhooks that record events
But most importantly: it helps you think about the evolution of your code. Plan each version you are about to release and visualise how it impacts your data when exposed to Green/Blue or mixed-version scenarios.
I have seen no other tool capable of showing this simply beyond the scope of very technical analysis meant for engineers.
This is part of a series on Event Sourcing and Modeling while reviewing the book Designing Data-Intensive Applications by Martin Kleppmann.
Learn something new? Share it. Slide me a DM or comment.