Query performance comparison for several SQL engines

by 司马顿 | 2022年4月7日 下午6:44

In this article I give a simple test on query performance to several SQL engines. The test environment as following.

Hive version:

$ hive --version
Hive 3.1.2

Hadoop version:

$ hadoop version
Hadoop 3.3.2

Java version:

$ java -version
java version "1.8.0_321"
Java(TM) SE Runtime Environment (build 1.8.0_321-b07)
Java HotSpot(TM) 64-Bit Server VM (build 25.321-b07, mixed mode)

Mysql info:

mysql> select version();
+-------------------------+
| version()               |
+-------------------------+
| 5.7.36-0ubuntu0.18.04.1 |
+-------------------------+
1 row in set (0.00 sec)

mysql> show variables like 'key_buffer_size';
+-----------------+-----------+
| Variable_name   | Value     |
+-----------------+-----------+
| key_buffer_size | 268435456 |
+-----------------+-----------+
1 row in set (0.00 sec)

Spark version:

version 3.2.0

Drill version:

apache drill> select version from sys.version;
+---------+
| version |
+---------+
| 1.20.0  |
+---------+

OS info:

$ cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.6 LTS"

Hardware specs:

CPU: double AMD EPYC 7302
RAM: 4GB dedicated
DISK: 40GB NVME
Virtualization: KVM

Size and sampling of the dataset:

mysql> select count(*) from ppl;
+----------+
| count(*) |
+----------+
|  1000000 |
+----------+
1 row in set (0.38 sec)

mysql> select * from ppl limit 1;
+-------------+--------+------------+-------+-----------------+------------------------------+--------+
| name        | sex    | born       | zip   | email           | job                          | salary |
+-------------+--------+------------+-------+-----------------+------------------------------+--------+
| Charli A.Q. | Female | 1988-01-24 | 75347 | [email protected] | Child & Family Social Worker |  14681 |
+-------------+--------+------------+-------+-----------------+------------------------------+--------+
1 row in set (0.00 sec)

The Hive implementation:

hive> select job,avg(salary) as avg_salary from ppl group by job order by avg_salary desc limit 20;

IT Manager	17682.772872552938
Computer Systems Administrator	17663.631755044105
Cashier	17620.562191510366
Plumber	17612.91147476341
Veterinary Technologist & Technician	17608.890145072535
Fabricator	17604.09839886502
Computer Programmer	17591.551812450747
Middle School Teacher	17590.884146341465
Physical Therapist	17590.864870299618
Bus Driver	17581.65017168249
Security Guard	17581.07212652362
Financial Advisor	17574.451710952053
Mental Health Counselor	17573.052391105695
Pharmacy Technician	17572.549974937345
Sales Manager	17565.845252284467
Computer Systems Analyst	17562.52641202763
Maintenance & Repair Worker	17555.82058970515
Painter	17553.907816229115
Administrative Assistant	17551.82823267126
Lawyer	17549.650365768113

Time taken: 5.488 seconds, Fetched: 20 row(s)

The Spark implementation (which takes 2.2s):

scala> df.groupBy("job").agg(avg("salary").alias("avg_salary")).orderBy(desc("avg_salary")).show(20,false)
+------------------------------------+------------------+                       
|job                                 |avg_salary        |
+------------------------------------+------------------+
|IT Manager                          |17682.772872552938|
|Computer Systems Administrator      |17663.631755044105|
|Cashier                             |17620.562191510366|
|Plumber                             |17612.91147476341 |
|Veterinary Technologist & Technician|17608.890145072535|
|Fabricator                          |17604.09839886502 |
|Computer Programmer                 |17591.551812450747|
|Middle School Teacher               |17590.884146341465|
|Physical Therapist                  |17590.864870299618|
|Bus Driver                          |17581.65017168249 |
|Security Guard                      |17581.07212652362 |
|Financial Advisor                   |17574.451710952053|
|Mental Health Counselor             |17573.052391105695|
|Pharmacy Technician                 |17572.549974937345|
|Sales Manager                       |17565.845252284467|
|Computer Systems Analyst            |17562.52641202763 |
|Maintenance & Repair Worker         |17555.82058970515 |
|Painter                             |17553.907816229115|
|Administrative Assistant            |17551.82823267126 |
|Lawyer                              |17549.650365768113|
+------------------------------------+------------------+
only showing top 20 rows

The Drill implementation:

> select job,avg(cast(salary as float)) as avg_salary from `100w.csv` group by job order by avg_salary desc limit 20;
+--------------------------------------+--------------------+
|                 job                  |     avg_salary     |
+--------------------------------------+--------------------+
| IT Manager                           | 17682.772872552938 |
| Computer Systems Administrator       | 17663.631755044105 |
| Cashier                              | 17620.562191510366 |
| Plumber                              | 17612.91147476341  |
| Veterinary Technologist & Technician | 17608.890145072535 |
| Fabricator                           | 17604.09839886502  |
| Computer Programmer                  | 17591.551812450747 |
| Middle School Teacher                | 17590.884146341465 |
| Physical Therapist                   | 17590.864870299618 |
| Bus Driver                           | 17581.65017168249  |
| Security Guard                       | 17581.07212652362  |
| Financial Advisor                    | 17574.451710952053 |
| Mental Health Counselor              | 17573.052391105695 |
| Pharmacy Technician                  | 17572.549974937345 |
| Sales Manager                        | 17565.845252284467 |
| Computer Systems Analyst             | 17562.52641202763  |
| Maintenance & Repair Worker          | 17555.82058970515  |
| Painter                              | 17553.907816229115 |
| Administrative Assistant             | 17551.82823267126  |
| Lawyer                               | 17549.650365768113 |
+--------------------------------------+--------------------+
20 rows selected (1.237 seconds)

The Mysql implementation (job column has index added):

mysql> select job,avg(salary) as avg_salary from ppl group by job order by avg_salary desc limit 20;
+--------------------------------------+--------------------+
| job                                  | avg_salary         |
+--------------------------------------+--------------------+
| IT Manager                           | 17682.772872552938 |
| Computer Systems Administrator       | 17663.631755044105 |
| Cashier                              | 17620.562191510366 |
| Plumber                              |  17612.91147476341 |
| Veterinary Technologist & Technician | 17608.890145072535 |
| Fabricator                           |  17604.09839886502 |
| Computer Programmer                  | 17591.551812450747 |
| Middle School Teacher                | 17590.884146341465 |
| Physical Therapist                   | 17590.864870299618 |
| Bus Driver                           |  17581.65017168249 |
| Security Guard                       |  17581.07212652362 |
| Financial Advisor                    | 17574.451710952053 |
| Mental Health Counselor              | 17573.052391105695 |
| Pharmacy Technician                  | 17572.549974937345 |
| Sales Manager                        | 17565.845252284467 |
| Computer Systems Analyst             |  17562.52641202763 |
| Maintenance & Repair Worker          |  17555.82058970515 |
| Painter                              | 17553.907816229115 |
| Administrative Assistant             |  17551.82823267126 |
| Lawyer                               | 17549.650365768113 |
+--------------------------------------+--------------------+
20 rows in set (3.04 sec)

The final comparison:

HiveSparkDrillMysql
Query Time5.5s2.2s1.2s3.0s

As you see, for this scale of dataset (1 million items), for a simple SQL query by grouping and aggregating, Drill is still the fastest. Hive is the slowest, though it’s using the very old MR engine. I didn’t test Tez engine with Hive. Spark is the second fastest, that should be reasonable, since both Spark and Drill have almost the same implementation architecture. Mysql is using InnoDB engine. It was the default installation, I took the only tuning by increasing key_buffer_size to 256MB.

All the queries have been run many times. I choose the average values as final result.

Source URL: https://smart.postno.de/archives/3765