Battle of the RDBMS Giants in AWS

MySQL vs PostgreSQL is one of the never-ending wars in the RDBMS world. This article will help you to choose the best fit for your use case.

Vithulan MV
Sysco LABS Sri Lanka

--

Source — silota

Background

Recently, during a project I was working on, we were given a requirement:

— “To retrieve records from the DB within 10 milliseconds consistently, from two large data sources/tables.”

One data source contained approximately 686 million records and the other contained 24 million records.

On top of this, the requirement stated that the preference for this solution should be towards an AWS managed DBMS, so the database instances can be easily managed in the long run.

Thus, with these requirements, the hunt for the best-suited DBMS began. 🤞🏽

Why Amazon Aurora?

With the data volume being huge, and the read rate high (about 600 selects/sec at peak), it becomes necessary to have a highly scalable distributed DBMS solution.

According to the CAP theorem, distributed systems can only have two of these three options: Consistency, Availability and Partition Tolerance.

However, it is also known that network failures (partition tolerance) are unavoidable in distributed systems.

CAP theorem

In our use-case, we needed to meet these two criteria:

  1. The system should always be available.
  2. The data should be distributed across nodes for high availability and durability.

This makes our system an AP System (Availability-Partition tolerance).

AWS provides RDS and Aurora when it comes to RDBMS solutions. As per official AWS documentation, they provide 99.99% availability, replicating 6 copies of the data across 3 Availability Zones and backing up the data continuously to Amazon S3 with the Amazon Aurora solution. There may be an inconsistent state during replication or communication failures between nodes. However, Amazon Aurora achieves near real-time consistency within milliseconds.

They also claim that Aurora MySQL is five times faster than RDS MySQL, and Aurora PostgreSQL is three times faster than the RDS PostgreSQL.

Therefore, given the facts, it was a no brainer to choose Amazon Aurora.

The next step would be to decide on which Aurora DB to choose. Given the importance of this, we conducted a few experiments with MySQL Aurora and PostgreSQL Aurora to help us decide which of these would be the best fit for our use case.

Optimizations

We found the best solution with both MySQL and PostgreSQL after conducting shardings at the database level by creating different databases to store similar data which came from the same source, and conducting multiple optimizations to the query and tables by experimenting with different indices.

Sharding is achieved through an ETL (Extract Transform Load) process. It decides from which database to load the incoming data, and the system finds the relevant database to retrieve the data from based on the user request.

The MySQL table schema and the query are shown below (after masking the attribute names and table names) to give you visibility before you proceed into the performance test results. Similar schemas and queries were developed and used in PostgreSQL with default settings.

MySQL Table Schema:

CREATE TABLE IF NOT EXISTS `table_1` (
`att1` VARCHAR(9) NOT NULL,
`att2` TINYINT(1) UNSIGNED NOT NULL,
`att3` VARCHAR(14) NOT NULL,
`att4` DATETIME NOT NULL,
`att5` BIGINT(10) UNSIGNED NOT NULL,
`att6` BIGINT(10) UNSIGNED NOT NULL,
PRIMARY KEY (`att3`, `att1`)
) ENGINE=INNODB DEFAULT CHARSET=UTF8;
CREATE TABLE IF NOT EXISTS `table_2` (
`att1` VARCHAR(9) NOT NULL,
`att2` TINYINT(1) UNSIGNED NOT NULL,
`att3` DOUBLE(13 , 3) NOT NULL,
`att4` DATETIME NOT NULL,
`att5` BIGINT(10) UNSIGNED NOT NULL,
`att6` CHAR(1) NOT NULL,
`att7` BIGINT(10) UNSIGNED NOT NULL,
`att8` BIGINT(10) UNSIGNED NOT NULL,
PRIMARY KEY (`att1`,`att2`,`att4`, `att5`)
) ENGINE=INNODB DEFAULT CHARSET=UTF8

The MySQL query that was used (attributes and table names are masked):

SELECT
table_2_Outer.att6,
table_2_Outer.att1,
table_2_Outer.att2,
table_2_Outer.att3,
table_2_Outer.att4,
table_2_Outer.att5
FROM
att2 table_2_Outer force index (`PRIMARY`)
INNER JOIN
(
SELECT
Max(table_2_Inner.att3) max_att3,
table_2_Inner.att6,
table_2_Inner.att1
FROM
(
SELECT
e.att6,
e.att1,
e.att7
FROM
table_1 e force index (`PRIMARY`)
WHERE
e.att7 = ?
AND att6 IN
(
?
)
)
pz
INNER JOIN
table_2 table_2_Inner force index (`PRIMARY`)
ON pz.att6 = table_2_Inner.att6
AND pz.att1 = table_2_Inner.att1
AND table_2_Inner.att3 <= ?
GROUP BY
table_2_Inner.att6,
table_2_Inner.att1
)
c
ON c.att6 = table_2_Outer.att6
AND c.att1 = table_2_Outer.att1
AND c.MAX_EFF_DATE = table_2_Outer.att3

Aurora MySQL vs Aurora PostgreSQL

We created two clusters of Aurora MySQL and Aurora PostgreSQL with similar specs. Each of these clusters were loaded with the same data, and then we carried out performance tests.

Fig.1 Cluster specs that were used in the experiment
Fig.2 db.r5.xlarge specs

Read Operation Comparison

Since the requirement stated that the application needs to have more reads than write operations, we needed to put more weight into providing read operation related stats. The same data and a similar query with the same index in both the databases were used to measure the query latency. You can see the results in the figure below.

Fig.3 Read operation results

The latencies we observed were heavily impacted by the JOIN operation and MAX aggregation used in the test query. Aurora Postgres’s JOIN operation and MAX aggregation performs better when the data size is large, and it stays consistent overall (in both smaller and larger tables). But Aurora MySQL’s JOIN operation and MAX aggregation outperform Aurora PostgreSQL in smaller table sizes.

Write & Index operation Comparison

Two data sources were loaded into both the MySQL and PostgreSQL separately, but in a similar manner. The time taken to complete the load, disk usage and CPU usage during the data loads were measured separately and shown in Fig. 4 and Fig. 5. Indexes were created upon completion of the data load and the time taken to complete the indexing process and CPU usage during the period was also measured (included in Fig. 5).

Fig.4 First data source load results
Fig.5 Second data source load results

Verdict

With the findings of these experiments, the following verdicts can be derived.

  1. PostgreSQL generally performs better at reading large amounts of data while MySQL performs better at reading the data when there is a comparatively smaller amount of records (below 100 million records per table in these experiments) under default settings.
  2. Latencies observed in this experiment were largely impacted by the JOIN operation and MAX aggregation.

Here is a checklist for you to choose the perfect database among MySQL and PostgreSQL based on your use case.

Verdict

Acknowledgement

This experiment and the blog is a collaborative work with my colleague Chamara Madhushan Liyanage.

Stay Curious 🤓.

--

--