Strategies to Manage Event Sourcing Disk Space
Worried about too much data? Me too. Come and learn, stop worrying.
Today’s issue of Crafting Tech Teams is exploring disk space usage in the context of Event Sourcing in complex data-intensive applications. This is part of a series while I’m having some time off on the beach.
Prevention: Keep your stream lifecycle short
The size of your streams determines the amount of flexibility in your data and storage optimisation strategies.
Large, monolithic streams will create a sluggish event store.
Small, time-boxed streams enable your team to manage aggregates easily. Move and store them individually, enabling a whole range of curation methods from tombstoning, archiving to log compaction.
Domain specifics to look for lifecycle changes:
Aggregates changing owner or exclusive assignment frequently
Write-side of data sets whose read-sides are queried exclusively by time range
Workflows where an entity creates others
Curation: Closing the Books
Experienced Event Sourcing architects are likely to include tombstoning strategies into their event modeling designs. Tombstoning refers to closing an Event Stream. Each Event Stream has a name unique to the store and server.
When closed, no further events are allowed onto this stream, rendering it dead. Its projections can also be considered immutable at this point, except for code changes in the projector logic. It’s up to the architects and engineers to figure out a balance here.
How long do we need to keep this data?
What Event closes a stream?
How long after tombstoning can we safely delete the data?
Which parts of tombstoned records need to be kept forever?
Snapshots and Log Compaction
The book DDIA mentions these two strategies for any kind of Log-based database storage. Whether that’s an OLTP database’s replication log, a message broker’s durable buffer or an event store’s persistence.
Snapshotting is the materialisation of a state projection. This helps us avoid needing to go through the entire replay lifecycle. This is helpful for CDC, but might cause some problems in Event Sourcing—notably code updates in our projector. Every time we start using Snapshots as an intermediate cache we will encounter a split-brain effect of running the events again vs. running from a snapshotted event.
This is a good transitioning strategy, but doesn’t allow for a simple to understand evolving schema.
Log Compaction is the process of calculating deltas between records N and N+1 and only keeping the differences. Compaction will be known to avid users of Redis Streams. If you ever had to implement a JSON-based webhook with an event sourced system you have also encountered this problem: often two consecutive events carry a large amount of duplicate data.
This creates a slight delay in compute necessary to replay the events (and can sometimes be faster) but saves significantly on disk storage if the events are already in a stable schema-format. This is common for event sourcing side effects from JSON REST commands or following a Mongo or Redis stream via CDC. Think of it as "garbage collecting” duplicate data within an Event Stream.
In Kafka for example, log compaction is running the other way around: the base offset contains the latest snapshot of a stream and the events are the “undo” operations from it (ie. N, N-1, N-2, N-3). This creates an intermediary derivates state for what we may consider a projection with the possibility of its older Event Records being garbage collected when necessary to reclaim disk space.
Tombstoning
Continuing with the idea of Log Compaction, a stream whose identifier has received a special character—usually a dollar $ or special Null—can safely be compacted and garbage collected, essentially removing all its records from the event store.
This is usually done by a special Event that finalises the underlying aggregate’s or entity’s lifecycle that the stream is backing. These are obviously domain-specific, so extra care has to be given to not applying it sporadically as a blanket solution.
Event modeling and event storming are very helpful meeting facilitation tools to discover these natural boundaries and lifecycles.
An example: creating a stream of events for a new Twitter message, closing the books for editing when it becomes inappropriate after 72 hours - 2 weeks—depending on when you are reading this.
This is part of a series on Event Sourcing and Modeling while reviewing the book Designing Data-Intensive Applications by Martin Kleppmann. Share it with your coworkers.