Data is everything. And by extension, so are databases. Here are some fantastic open source options for your next kick-ass project.
For a world dominated so long by database suits like Oracle and SQL Server, there seems to be an endless flurry of solutions now. One part of the reason is innovation fueled by Open Source — really talented developers wanting to scratch an itch and creating something that they can revel in.
The other part is the emergence of new business models, wherein businesses maintain a community version of their product to gain mind share and traction while also providing a commercial, add-on offering.
More databases than one can keep up with. There’s no official stat on this, but I’m pretty sure we have over a hundred options available today if you combine everything from stack-specific object databases to not-so-popular projects from universities.
I know; it frightens me, too. Too many options — too much documentation to go through — and a life that is so short. 🙂
That’s why I decided to write this article, presenting ten of the best databases you can use to improve your solutions, whether building for yourself or others.
Please note: this list isn’t going to contain MySQL, even though it’s arguably the most popular Open Source database solution out there.
Simply because MySQL is everywhere — it’s what everyone learns first, it’s supported by virtually every CMS or framework out there, and it’s very, very good for most use cases. In other words, MySQL doesn’t need to be “discovered.” 🙂
That said, please note that the following aren’t necessarily alternatives to MySQL. In some cases, they might be, while in others, they’re a completely different solution for an entirely different need. Don’t worry, as I’ll be discussing their uses also.
Special note: compatibility
Before we begin, I also must mention that compatibility is something you need to keep in mind. If you have a project that, for whatever reason, supports only a particular database engine, your choices are pretty much shot through.
For instance, if you’re running WordPress, this article is of no use to you. 🙂 Similarly, those running static sites on JAMStack will gain nothing by looking for alternatives too seriously.
It’s up to you to figure out the compatibility equation. However, if you have a blank slate and the architecture is up to you, here are some neat recommendations.
If you’re from the PHP land (WordPress, Magento, Drupal, etc.), then PostgreSQL will sound foreign to you. However, this relational database software has been around since 1997 and is the top choice in communities like Ruby, Python, Go, etc.
In fact, many developers eventually “graduate” to PostgreSQL for the features it offers or simply for stability. It’s hard to convince someone in a short write-up like this but think of PostgreSQL as a thoughtfully engineered product that never lets you down.
There are many good SQL clients available to connect to the PostgreSQL database for administration and development.
PostgreSQL has several fascinating features as compared to other relational databases (specifically, MySQL), such as:
- Built-in data types for Array, Range, UUID, Geolocation, etc.
- Native support for document storage (JSON-style), XML, and key-value storage (Hstore)
- Synchronous and asynchronous replication
- Scriptable in PL, Perl, Python, and more
- Full-text search
My personal favorites are the geolocation engine (which takes away the pain when working with location-based apps — try finding all nearby points manually, and you’ll know what I mean) and support for arrays (many MySQL projects are undone for want of arrays, opting instead for the infamous comma-separated strings).
When to use PostgreSQL
PostgreSQL is always a better choice than any other relational database engine. That is, if you’re starting a new project and have been bitten by MySQL before, it’s a good time to consider PostgreSQL. I have friends who gave up battling MySQL’s mysterious transactional lock failures and moved on permanently. If you decide the same, you won’t be overreacting.
PostgreSQL also has a clear advantage if you need partial NoSQL facilities for a hybrid data model. Since document and key-value storage are natively supported, you don’t need to go hunting for, installing, learning, and maintaining another database solution.
When not using PostgreSQL
PostgreSQL doesn’t make sense when your data model isn’t relational and/or when you have very specific architectural requirements. For instance, consider Analytics, where new reports are constantly being created from existing data. Such systems are read-heavy and suffer when a strict schema is imposed on them. Sure, PostgreSQL has a document storage engine, but things start to fall apart when you’re dealing with large datasets.
In other words, always use PostgreSQL unless you know 100% what you’re doing! 🙂
Check out this SQL & PostgreSQL for Beginners course if interested in learning more.
MariaDB was created as a replacement for MySQL by the same person who developed MySQL.
Well, actually, after MySQL was taken over by Oracle in 2010 (by acquiring Sun Microsystems, which, incidentally, is also how Oracle came to control Java), the creator of MySQL started a new open source project called MariaDB.
Why does all this boring detail matter, you ask? It’s because MariaDB was created from the same code base as that of MySQL (in the open-source world, this is known as “forking” an existing project). As a result, MariaDB is presented as a “drop-in” replacement for MySQL.
That is, if you’re using MySQL and want to migrate to MariaDB, the process is so easy that you just won’t believe it.
Unfortunately, such a migration is a one-way street. Going back from MariaDB to MySQL is not possible, and should you try to use force, permanent database corruption is ensured!
While MariaDB is essentially a clone of MySQL, it’s not strictly true. Ever since the introduction of the database, the differences between the two have been growing. As of writing, adopting MariaDB needs to be a well-thought-through decision on your part. That said, there are plenty of new things going on in MariaDB that may help you make this transition:
- Truly free and open: Since there’s no single corporate entity controlling MariaDB, you can be free of sudden predatory licensing and other worries.
- Several more options of storage engines for specialized needs: for instance, the Spider engine for distributed transactions; ColumnStore for massive data warehousing; the ColumnStore engine for parallel, distributed storage; and many, many more.
- Speed improvements over MySQL, especially due to the Aria storage engine for complex queries.
- Dynamic columns for different rows in a table.
- Better replication capabilities (for example, multi-source replication)
- Several JSON functions
- Virtual columns
. . . And many, many more. It’s exhausting to keep up with all the MariaDB features. 🙂
When to use MariaDB
If you want a true replacement of MySQL, you should use MariaDB as they want to stay on the innovation curve and don’t plan on returning to MySQL again. One excellent use case is the use of new storage engines in MariaDB to compliment your project’s existing relational data model.
When not to use MariaDB
Compatibility with MySQL is the only concern here. That said, it’s becoming less of a problem as projects like WordPress, Joomla, Magento, etc., have started supporting MariaDB. My advice would be not to use MariaDB to trick a CMS that doesn’t support it, as many database-specific tricks will crash the system easily.
The team behind CockroachDB seems to be composed of masochists. With a product name like that, surely they want to turn all odds against them and still win?
Well, not quite.
The idea behind “cockroach” is that it’s an insect built for survival. No matter what happens — predators, floods, eternal darkness, rotting food, bombing, the cockroach finds a way to survive and multiply.
The idea is that the team behind CockroachDB (composed of former Google engineers) was frustrated with the limitations of traditional SQL solutions when it comes to large scale. That’s because, historically, SQL solutions were supposed to be hosted on a single machine (data wasn’t that big). For a long time, there was no way to build a cluster of databases running SQL, which is why MongoDB captured so much attention.
Even when replication and clustering came out in MySQL, PostgreSQL, and MariaDB, it was painful at best. CoackroachDB wants to change that, bringing effortless sharding, clustering, and high availability to the world of SQL.
When to use CockroachDB
CockroachDB is the system architect’s dream come true. If you swear by SQL and have been simmering at the scaling capabilities of MongoDB, you’ll love CockroachDB. Now you can quickly set up a cluster, throw queries at it, and sleep peacefully at night. 🙂
When not to use CockroachDB
Better the devil you know than the one you don’t. By that, I mean, if your existing RDBMS is working well for you and you think you can manage the scaling pains it brings, stick with it. CockroachDB is a new product for all the genius involved, and you don’t want to be struggling against it later on. Another major reason is SQL compatibility — if you’re doing exotic SQL stuff and rely on it for critical things, CockroachDB will present too many edge cases for your liking.
From now on, we’ll consider non-SQL (or NoSQL, as it’s called) database solutions for highly specialized needs.
Searching for a fast, open-source OLAP database system?
Go for ClickHouse.
It uses every hardware to its maximum potential to approach each query faster. The peak performance of processing a query usually remains more than two terabytes each second. To avoid increased latency, reads are balanced automatically among the healthy replicas.
It supports multi-master async replication, and you can deploy it across different data centers. As nodes are maintained equal, you can avoid even single failure points. The downtime of either a single node or the complete data center will never affect system availability in terms of writes and reads.
ClickHouse is very easy to use and simple. It streamlines data processing, puts all your data into a system in an organized manner, and is available instantly to build reports. Moreover, SQL dialect helps to express the result without using any non-standard API, which you can get in alternative systems.
You can rely on this database management system to configure it as a distributed system that is located on separate nodes without failure points. Additionally, its security features are robust, including enterprise-grade security and fail-safe mechanisms in case of human errors.
ClickHouse can process queries faster compared to row-oriented systems having the same CPU capacity and I/O throughput. Its columnar data storage format helps keep more data inside RAM, resulting in shorter response times.
The total ownership cost can be reduced by commodity hardware having rotary disk drives rather than using NVMe/SSD without sacrificing latency for the queries. It strives for CPU efficiency, optimizes access to the disk drive, and minimizes data transfers.
Furthermore, due to its feature-laced SQL database, you can efficiently process your query in no time, join co-located and distributed data, manage denormalized information efficiently, and more. ClickHouse scales horizontally and vertically and adapts easily to perform on a single server or clusters with thousands of nodes.
Use ClickHouse for web and application analytics, telecommunications, ad network, online games, IoT, business intelligence, finance, eCommerce, monitoring, and more.
It integrates with Hadoop, Postgres, and MySQL.
One of the most significant developments in the recent decade is connected data. The world around us is not partitioned into tables and rows and boxes — it’s one giant mess with everything connected to almost everything else.
Social networks are a prime example, and building a similar data model using SQL or even document-based databases is a nightmare.
That’s because the ideal data structure for these solutions is the graph, which is an entirely different beast. And for that, you need a graph database like Neo4j.
The example above was taken directly from the Neo4j website and shows how university students are connected to their departments and courses. Such a data model is plain impossible with SQL, as it’ll be tough to avoid infinite loops and memory overruns.
Graph databases are unique in themselves, and Neo4j is pretty much the only option for working with graphs. As a result, whatever features it has are unique. 🙂
- Support for transactional applications and graph analytics.
- Data transformation abilities for digesting large-scale tabular data into graphs.
- Specialized query language (Cypher) for querying the graph database
- Visualization and discovery features
It’s a moot point to discuss when to use Neo4j and when not. If you need graph-based relationships between your data, you need Neo4j. 🙂
MongoDB was the first non-relational database to make big waves in the tech industry and continues to dominate a fair share of attention.
Unlike relational databases, MongoDB is a “document database,” which stores data in chunks, with related data clumped together in the same chunk. This is best understood by imagining an aggregation of JSON structures like this:
Here, unlike a table-based structure, the user’s contact details and access levels reside inside the same object. Fetching the user object fetches the associated data automatically, and there’s no concept of a join. Here’s a more detailed intro to MongoDB.
MongoDB has some serious (I almost want to write “kick-ass” to convey the impact, but it wouldn’t be proper on a public website, perhaps) features that have made several seasoned architects abandon the relational land forever:
- A flexible schema for specialized/unpredictable use cases.
- Ridiculously simple sharding and clustering. You just need to set up the configuration for a cluster and forget about it.
- Adding or removing a node from a cluster is drop-dead simple.
- Distributed transactional locks. This feature was missing in the earlier versions but was eventually introduced.
- It is optimized for very fast writes, making it highly suitable for analytics data as a caching system.
If I sound like a spokesperson for MongoDB, I apologize, but it’s hard to oversell the advantages of MongoDB. Sure, NoSQL data modeling is weird at first, and some never get the hang of it, but for many architects, it almost always wins out over a table-based schema.
When to use MongoDB
MongoDB is a great crossover bridge from the structured, strict world of SQL to the amorphous, almost confusing one of NoSQL. It excels at developing prototypes, as there’s simply no schema to worry about and when you really need to scale. Yes, you can use a cloud SQL service to get rid of DB scaling issues, but boy, is it expensive!
Finally, there are use cases where SQL-based solutions just won’t do. For instance, if you’re creating a product like Canva, where the user can create arbitrarily complex designs and be able to edit them later, good luck with a relational database!
When not to use MongoDB
The complete lack of schema that MongoDB provides can work as a tar pit for those who don’t know what they’re doing. Data mismatch, dead data, empty fields that should not be empty — all this and much more is possible. MongoDB is essentially a “dumb” data store, and if you choose it, the application code has to take a lot of responsibility for maintaining data integrity.
If you are a developer, then you will find this useful.
As its name goes, RethinkDB “rethinks” the idea and capabilities of a database when it comes to real-time apps.
When a database gets updated, there’s no way for the application to know. The accepted approach is for the app to fire off a notification as soon as there’s an update, which gets pushed to the front-end through a complex bridge (PHP -> Redis -> Node -> Socket.io is one example).
But what if the updates could be pushed directly from the database to the front-end?!
Yes, that’s the promise of RethinkDB. So if you’re on to making a true real-time application (game, marketplace, analytics, etc.), Rethink DB is worth a look.
When it comes to databases, it’s almost too easy to overlook the existence of Redis. That’s because Redis is an in-memory database and is mostly used in support functions like caching.
Learning this database is a ten-minute job (literally!), and it’s a simple key-value store that stores strings with an expiry time (which can be set to infinity, of course). What Redis loses in features it makes up for in utility and performance. Since it lives entirely in RAM, reads and writes are insanely fast (a few hundred thousand operations per second aren’t unheard of).
Redis also has a sophisticated pub-sub system, which makes this “database” twice as attractive.
In other words, if you have a project that could benefit from caching or has some distributed components, Redis is the first choice.
Yes, I promised that we were done with relational databases, but SQLite is too cute to ignore.
SQLite is a lightweight C library that provided a relational database storage engine. Everything in this database lives in a single file (with a .sqlite extension) that you can put anywhere in your filesystem. And that’s all you need to use it! Yes, no “server” software to install and no service to connect to.
Even though SQLite is a lightweight alternative to a database like MySQL, it packs quite a punch. Some of its shocking features are:
- Full support for transactions, with COMMIT, ROLLBACK, and BEGIN.
- Support for 32,000 columns per table
- JSON support
- 64-way JOIN support
- Subqueries, full-text search, etc.
- Maximum database size of 140 terabytes!
- Maximum row size of 1 gigabyte!
- 35% faster than file I/O
When to use SQLite
SQLite is an extremely specialized database that focuses on a no-nonsense, get-shit-done approach. If your app is relatively simple and you don’t want the hassle of a full-blown database, SQLite is a serious candidate. It makes particular sense for small- to mid-sized CMSs and demo applications.
When not using SQLite
While impressive, SQLite doesn’t cover all the features of standard SQL or your favorite database engine. Clustering, stored procedures, and scripting extensions are missing. Also, there’s no client to connect, query, and explore the database. Finally, as the application size grows, performance will degrade.
While many proclaim that the end is near for Java, every once in a while, the community drops a bombshell and silences the critics. Cassandra is one such example.
Cassandra belongs to what’s known as the “columnar” family of databases. The storage abstraction in Cassandra is a column rather than a row. The idea here is to store all the data in a column physically together on the disk, minimizing seek time.
Cassandra was designed with a specific use case in mind — dealing with write-heavy loads and zero tolerance for downtime. These become its unique selling points.
- Extremely fast write performance. Cassandra is arguably the fastest database out there when it comes to handling heavy write loads.
- Linear scalability. That is, you can keep adding as many nodes to a cluster as you want, and there will be a zero increase in complexity or brittleness of the cluster.
- Unmatched partition tolerance. That is, even if multiple nodes in a Cassandra cluster go down, the database is designed to keep performing without loss of integrity.
- Static typing
When to use Cassandra
Logging and analytics are two of the best use cases for Cassandra. But that’s not all — the sweet spot is when you need to handle really large sizes of data (Apple has a Cassandra deployment handling 400+ petabytes of data while at Netflix, it handles 1 trillion requests a day) with literally zero downtime. High availability is one of the hallmarks of Cassandra.
When not to use Cassandra
The column storage scheme of Cassandra also has its disadvantages. The data model is rather flat, and if you need aggregations, then Cassandra falls short. Moreover, it achieves high availability by sacrificing consistency (remember the CAP theorem for distributed systems), which makes it less suitable for systems where high read accuracy is needed.
New developments demand new types of databases, and the Internet of Things (IoT) is one such phenomenon. One of the best open source databases for that is Timescale.
The timescale is a type of what’s called a “time series” database. It’s different from a traditional database in that time is the primary axis of concern, and the analytics and visualization of massive data sets is a top priority. Time series databases rarely see a change in existing data; an example is temperature readings sent by a sensor in a greenhouse — new data keeps getting accumulated every second, which is of interest for analytics and reporting.
Why not only use a traditional database with a timestamp field, then? Well, there are two main reasons for that:
- General-purpose databases are not optimized to work with time-based data. For the same amounts of data, a general-purpose database will be much slower.
- The database needs to handle massive amounts of data as new data keeps flowing in, and removing data or changing schema, later on is not an option.
Timescale DB has some exciting features that set it apart from other databases in the same category:
- It’s built on PostgreSQL, arguably the best open-source relational database out there. If your project is already running PostgreSQL, Timescale will slide right in.
- Querying is done through the familiar SQL syntax, reducing the learning curve.
- Ridiculously fast write speeds — millions of inserts per second aren’t unheard of.
- Billions of rows or petabytes of data — it’s no big deal for Timescale.
- True flexibility with schema — choose from relational or schemaless as per your needs.
It doesn’t make much sense to talk about when to use or not use Timescale DB. If IoT is your domain, or you’re after similar database characteristics, Timescale is worth a look.
CouchDB is a neat little database solution that sits quietly in a corner and has a small but dedicated following. It was created to deal with the problems of a network loss and eventual resolution of data, which happens to be a problem so messy that developers would instead switch jobs than deal with it.
Essentially, you can think of a CouchDB cluster as a distributed collection of nodes large and small, some of which are expected to be offline. As soon as a node comes online, it sends data back to the cluster, which is slowly and carefully digested, eventually becoming available to the entire cluster.
CouchDB is something of a unique breed when it comes to databases.
- Offline-first data syncing capabilities
- Specialized versions for mobile and web browsers (PouchDB, CouchDB Lite, etc.)
- Crash-resistant, battle-tested reliability
- Easy clustering with redundant data storage
When to use CouchDB
CouchDB was built for offline tolerance and remains unmatched in this regard. A typical use case is mobile apps where a portion of your data resides on a CouchDB instance on the user’s phone (because that is where it was generated). The exciting thing is that you cannot rely on the user’s device to be connected all the time, which means the database has to be opportunistic and be ready to resolve conflicting updates later on. This is achieved using the impressive Couch Replication Protocol.
When not to use CouchDB
Trying to use CouchDB outside of its intended use case will lead to disaster. It uses way much more storage than anything else out there, simply because it needs to maintain redundant copies of data and conflict resolution results. As a result, write speeds are also painfully slow. Finally, CouchDB is not suitable as a general-purpose schema engine, as it doesn’t play well with schema changes.
I had to leave out many interesting candidates like Riak, so this list is to be taken as a guide rather than a commandment. I hope I was able to achieve my goal with this article — present not just a collection of database software recommendations but also briefly discuss where and how they need to be used (and avoided!).
If you are curious to learn databases, check out Udemy for some brilliant online courses.