Evolution of Data

MapReduce, Data Warehouses, Lakes

Jul 12, 2023

Welcome to Data week! Today’s issue of Crafting Tech Teams is one of many in a weekly series discussing event modeling, event sourcing.

Inspired by the book Designing Data-Intensive Applications by Martin Kleppmann, we will be looking into the data-processing and data-modelling aspects of the book to extract some parallels with modern tools.

brown wooden drawer — Photo by Jan Antonin Kolar on Unsplash

Unix stdin and stdout are powerful abstractions

A hallmark in software design—the unix shell’s capability to pipe commands together into a pipeline of commands and have them take care of the nitty gritty of buffering data to each other is a marvellous invention!

As the book also highlights, this isn’t too far from data processing pipelines that we encounter in modern OLTP and OLAP databases.

Write against a common, simple interface that is easy to encode and decode from. “Stringly typed” comes to mind. This is a very natural interface to perform a MapReduce operation. It also highlights a very important distinction: not all MapReduce operations need to be distributed.

The Evolution of MapReduce

I think you’d be hard-pressed to find someone on your team who doesn’t know about Map and Reduce. However, Google’s Technology MapReduce with the concatenated name might spring some question marks. Google set out to tackle an enormous limitation in the early 2000’s: use commoditised hardware on a much larger scale than was possible at the time with expensive server hardware.

To negate the increase in failures, the machines would perform network-coordinated data manipulation on a unified file system that the entire network shared.

Just like stdin and stdout. But Google-scale.

You can find the MapReduce white paper for OSDI 2004 on this link. MapReduce with Hadoop was the hallmark of OLAP processing in the early 2010s before the advent of reliable columnar databases and log-based message queues.

Applications are Glorified Database Gatekeepers

Break your app down into features, contexts, behaviors, classes–in the end it all comes down to data processes and data flow.

Rather than designing specs for database and data schemas build an understanding of the flow of data first. You will notice this theme throughout this week’s articles: tools and design processes related to understanding data in its flow and movement.

Most engineering practices advocate a static, analysed approached to database schemas that speak of the data in a rigid, single point-in-time manner that makes engineers practising continuous delivery running for the hills.

Doesn’t make data analysts’ lives any easier either. Understand how the data computation will look like. Then observe how to get the data to where the compute is.

The Importance of Locality

Modern ORMs have been adopted as a contentious topic for DB-to-OO modeling practices. Some teams hate them. Some teams love them. We won’t go into too much detail on that in this article. We will however look at one particular point that both camps seem to agree on being a problem: Joins.

Joins are notoriously the most-optimised types of lookups in modern ORMs. In order for them to be fast, they should be computed on one local, logical machine as much as possible. The tendency to shift towards document-based (NoSQL) databases comes from boundaries where that is no longer possible or favorable.

This brings us to another recurring theme of this week: keep the data where the computation is. Move the data towards the computation. When a complex join is performed via ORM and the database is well-capable of dealing with it—engineers on your team should have every right to optimise it towards database capability.

This nuance to these ideas shines a light on why Event Sourcing has been so popular lately: it’s very easy to move derivates of the source data around, optimising for many, specific use cases. For one thing— you know which database has the original source data. That’s a huge benefit in on itself.

Sushi Principle—schema-on-read approach

Centralising an “agreed-upon” data schema can slow down even the most productive business. When the data is used by different teams with different goals, they may opt to have their own schema to optimise for their own needs. Facilitating changes upstream to update its schema—and the data!– to the central warehouse via software architects and further to DBAs and Data Engineers and Data Scientists can take years even in small companies.

Instead of a standardised format, a raw-data-quickly strategy centralises the location of the data and reading and creating a common schema from it becomes a consumer-side problem, on demand.

graphs of performance analytics on a laptop screen — Photo by Luke Chesser on Unsplash

Warehouses, Data Cubes and Materialized Views

If you don’t want to run MapReduce for each job, you might try your luck just putting it into an enormous database and query it. After all—that’s what databases are for? At least in the OLAP sense.

Also known as “Big F**** Tables”, Data Cubes come from the era of large Excel Pivot tables. For many companies, the only way to perform rich analyses of their data requires all their relevant data to be aggregated to one type and flavour of database. There are many flavours and families: Aurora, Snowflake, Clickhouse, etc.

A data warehouse is an attempt to bring centralised analytical schemas to all database sources within an organization. This has a downside however—all data is still bound to a fixed schema and needs to be structured and transformed appropriately before allowed to be ingested.

Once ingested, it also has to be deduplicated as multiple data sources could contain overlapping (but not exclusively redundant) data points.

Data Cubes are a subset of Data Warehouse capabilities that aim to centralise key factors of a particular domain (like customer purchasing history) and break it down across all its relevant dimensions. A dimension is something by which you can filter, query or sort.

However, the size of the cube is the cartesian product of all cardinalities of all dimensions. This produces a rather glamorous blob of empty data cells that can quickly expand to all atoms in the known universe. This is especially a problem when certain dimensions are unbounded or contain sparse, chaotic data.

Data Lakes

Data Engineers and Scientists are notorious for having poor software engineering practices. Likewise, Software engineering teams are known to be rather pragmatic with databases to put it mildly.

To avoid the problem that arose with Data Warehouses, schemas and deduplication—and the queries still run slowly!—Data Lakes were imagined a solution where the data doesn’t need to be clean: it just needs to be in the same place.

If Data Warehouses are your accountant’s invoices and tax returns, then Data Lakes are the post office. Send it away and as long as it has a reasonable address they’ll categorise it somehow.

The limiting factor for very large databases is the hardware I/O. The only reasonable way to solve that is to split up the database onto many machines and use all the machines in parallel—bring the compute to where the data is.

This is part of a series on Event Sourcing and Modeling while reviewing the book Designing Data-Intensive Applications by Martin Kleppmann. Share it with your coworkers.