Engineers want the ergonomics of SQL query languages. So why do NoSQL databases exist?
SQL is added to everything from Kafka, Clickhouse, Elasticsearch to Log monitoring and parsing JSON. On top of that, SQL RDBMS systems race towards document column types, cloud clustering and vectorDB
‼️ Before we begin, the inspiration for today’s post is today’s upcoming stream episode where we’ll discuss the place of NoSQL in modern UI-based applications.
Why do we love SQL?
SQL is the most widely known and simplest declarative language in use across all tech industries. Engineers don’t have to deal with importing libraries, the bells and whistles of looping and exception handling.
Instead, once a query is written, it describes the output you want and the general path of how to get there as a hint to the query planner. The rest is on the engineer to transform the data and make good use of it.
However, it is extremely rare to use SQL query results in the form they are retrieved.
What the query looks like…
select concat (c.cust_first, ' ', c.cust_last) as name, c.contact_mail as email, a.street from cust_x_01 c, cust_x_01_addr a
where a.cust_id = c.base_num
What the application really needed…
{
"customer": {
"name": "John Doe",
"email": "john.doe@example.com",
"addresses": [
{
"street": ...
}
]
}
}
This is a code example from https://stepzen.com/blog/accessing-sql-nosql-databases-with-graphql-api —an article not about databases per se, but rather about the query language and how GraphQL bridges the gaps. More on that in another article.
What benefits do Document stores provide?
On the usage end of data storage, there’s the extremes of DynamoDB and ElasticSearch. Yes, it’s a NoSQL database with a web interface—Go figure!
Document stores are closer to what you would consider modern usage in web and mobile UIs. A JSON text or binary format, a vector store or a simple hash map of values.
MongoDB was among the first to offer this capability as a core feature. However, most NoSQL databases have some form of SQL language support. And most SQL RDBMS’ have document/json data types for columns. So what’s really going on here?
What the heck?
NoSQL has never been about not having SQL. Starting from MongoDB’s example the primary capability of NoSQL databases is the ability to cluster data.
Why are NoSQL dbs good at clustering data?
Because they don’t encourage (some completely prevent) storing data in tables. Especially with sequences primary keys. It’s the sorted nature of monotonic surrogate keys that make partitioning and sharding so notoriously difficult for most Relational DBMS.
Want to learn more on surrogate, sythnetic keys and this slide jogs your interest?
Then head over to Nuri Halperin’s talk. Video link below.
NoSQL
Although the term NoSQL came into mainstream use in the 2010s, most document and key/value stores that had clustering capabilities or write-ahead logs predate it:
BerkeleyDB 1994
memcached 2003
Apache Lucene + Solr 2004
CouchDB 2005
MemcacheDB 2008
Redis 2009
MongoDB 2009
SQL RDBMS’ Struggle with “Baggage”
Under the hood SQL RDBMS are a work of art when it comes to smart disk usage, query planning and transactional workloads. They are amazing universal tools.
It’s the very specific usage scenarios that they struggle with:
🤖 Connections pools (100+ can already pose a serious problem)
⌛️ Time series data
💨 Ephemeral data
🏃♂️ Temporary, strong data (one-off materialized views)
⏱ Data entries with TTLs, expiration or lifecycles
🪜 State machines (Step-data)
🪫 Releasing disk space after deleting data
📊 Strict no-mixing of OLTP and OLAP use cases (♥️ clickhouse)
🍇 Clustering
🏠 Locality of data (ie. keeping all of Denis’ data in one place)
👨💻 Exposing Web APIs
Pair that with common performance optimisation tips for SQL databases:
1. Cache your calls
Duh, the cache is going to be a NoSQL database!
2. Query using a covering index
A covering index is when the data you need is included in the index. Again, that’s series data once more which brings us close to NoSQL territory again.
3. Don’t fetch data you do not need
Which is a great tip… as long as you can know what you’ll need in advance of querying… Which in heterogeneous, user-controlled or plain chaotic use cases is impossible. This makes this tip conflict with the caching one above.
The Cloud
SQL databases have been very late to the party on hyper-scale, fully managed, cross-regional service. 2023 saw the advent of services such as Supabase, Planetscale and NeonDB.
Master-master replication was a ridiculously difficult (and expensive!) challenge for most RDBMS, and in most cases it has originated or been retained to enterprise-level paid tiers only (*cough* *cough* Oracle).
Traditionally managed SQL databases are also difficult to wire up to serverless (FaaS) functions for OLTP use cases due to the low connect limits mentioned previously.
Tools like Vitess that solve these problems have only come into mainstream enterprise adoption in the last 4-5 years. RDBMS are over 40 years old!
Learn More
Data culture, storage, design and technologies overlap on many different disciplines. Here are notable thought leaders in this area that I personally learned and got inspiration from.
Nuri Halperin 👉 Design Good Schemas - Get a Better Database: youtube
- 👉
- 👉 Author of THE DynamoDB book
Jonathan Hall 👉 Contributor CouchDB for Go: youtube
Want even more?