In this article I give a simple test on query performance to several SQL engines. The test environment as following.
Hive version:
$ hive --version
Hive 3.1.2
Hadoop version:
$ hadoop version
Hadoop 3.3.2
Java version:
$ java -version
java version "1.8.0_321"
Java(TM) SE Runtime Environment (build 1.8.0_321-b07)
Java HotSpot(TM) 64-Bit Server VM (build 25.321-b07, mixed mode)
Mysql info:
mysql> select version();
+-------------------------+
| version() |
+-------------------------+
| 5.7.36-0ubuntu0.18.04.1 |
+-------------------------+
1 row in set (0.00 sec)
mysql> show variables like 'key_buffer_size';
+-----------------+-----------+
| Variable_name | Value |
+-----------------+-----------+
| key_buffer_size | 268435456 |
+-----------------+-----------+
1 row in set (0.00 sec)
Spark version:
version 3.2.0
Drill version:
apache drill> select version from sys.version;
+---------+
| version |
+---------+
| 1.20.0 |
+---------+
OS info:
$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.6 LTS"
Hardware specs:
CPU: double AMD EPYC 7302
RAM: 4GB dedicated
DISK: 40GB NVME
Virtualization: KVM
Size and sampling of the dataset:
mysql> select count(*) from ppl;
+----------+
| count(*) |
+----------+
| 1000000 |
+----------+
1 row in set (0.38 sec)
mysql> select * from ppl limit 1;
+-------------+--------+------------+-------+-----------------+------------------------------+--------+
| name | sex | born | zip | email | job | salary |
+-------------+--------+------------+-------+-----------------+------------------------------+--------+
| Charli A.Q. | Female | 1988-01-24 | 75347 | [email protected] | Child & Family Social Worker | 14681 |
+-------------+--------+------------+-------+-----------------+------------------------------+--------+
1 row in set (0.00 sec)
The Hive implementation:
hive> select job,avg(salary) as avg_salary from ppl group by job order by avg_salary desc limit 20;
IT Manager 17682.772872552938
Computer Systems Administrator 17663.631755044105
Cashier 17620.562191510366
Plumber 17612.91147476341
Veterinary Technologist & Technician 17608.890145072535
Fabricator 17604.09839886502
Computer Programmer 17591.551812450747
Middle School Teacher 17590.884146341465
Physical Therapist 17590.864870299618
Bus Driver 17581.65017168249
Security Guard 17581.07212652362
Financial Advisor 17574.451710952053
Mental Health Counselor 17573.052391105695
Pharmacy Technician 17572.549974937345
Sales Manager 17565.845252284467
Computer Systems Analyst 17562.52641202763
Maintenance & Repair Worker 17555.82058970515
Painter 17553.907816229115
Administrative Assistant 17551.82823267126
Lawyer 17549.650365768113
Time taken: 5.488 seconds, Fetched: 20 row(s)
The Spark implementation (which takes 2.2s):
scala> df.groupBy("job").agg(avg("salary").alias("avg_salary")).orderBy(desc("avg_salary")).show(20,false)
+------------------------------------+------------------+
|job |avg_salary |
+------------------------------------+------------------+
|IT Manager |17682.772872552938|
|Computer Systems Administrator |17663.631755044105|
|Cashier |17620.562191510366|
|Plumber |17612.91147476341 |
|Veterinary Technologist & Technician|17608.890145072535|
|Fabricator |17604.09839886502 |
|Computer Programmer |17591.551812450747|
|Middle School Teacher |17590.884146341465|
|Physical Therapist |17590.864870299618|
|Bus Driver |17581.65017168249 |
|Security Guard |17581.07212652362 |
|Financial Advisor |17574.451710952053|
|Mental Health Counselor |17573.052391105695|
|Pharmacy Technician |17572.549974937345|
|Sales Manager |17565.845252284467|
|Computer Systems Analyst |17562.52641202763 |
|Maintenance & Repair Worker |17555.82058970515 |
|Painter |17553.907816229115|
|Administrative Assistant |17551.82823267126 |
|Lawyer |17549.650365768113|
+------------------------------------+------------------+
only showing top 20 rows
The Drill implementation:
> select job,avg(cast(salary as float)) as avg_salary from `100w.csv` group by job order by avg_salary desc limit 20;
+--------------------------------------+--------------------+
| job | avg_salary |
+--------------------------------------+--------------------+
| IT Manager | 17682.772872552938 |
| Computer Systems Administrator | 17663.631755044105 |
| Cashier | 17620.562191510366 |
| Plumber | 17612.91147476341 |
| Veterinary Technologist & Technician | 17608.890145072535 |
| Fabricator | 17604.09839886502 |
| Computer Programmer | 17591.551812450747 |
| Middle School Teacher | 17590.884146341465 |
| Physical Therapist | 17590.864870299618 |
| Bus Driver | 17581.65017168249 |
| Security Guard | 17581.07212652362 |
| Financial Advisor | 17574.451710952053 |
| Mental Health Counselor | 17573.052391105695 |
| Pharmacy Technician | 17572.549974937345 |
| Sales Manager | 17565.845252284467 |
| Computer Systems Analyst | 17562.52641202763 |
| Maintenance & Repair Worker | 17555.82058970515 |
| Painter | 17553.907816229115 |
| Administrative Assistant | 17551.82823267126 |
| Lawyer | 17549.650365768113 |
+--------------------------------------+--------------------+
20 rows selected (1.237 seconds)
The Mysql implementation (job column has index added):
mysql> select job,avg(salary) as avg_salary from ppl group by job order by avg_salary desc limit 20;
+--------------------------------------+--------------------+
| job | avg_salary |
+--------------------------------------+--------------------+
| IT Manager | 17682.772872552938 |
| Computer Systems Administrator | 17663.631755044105 |
| Cashier | 17620.562191510366 |
| Plumber | 17612.91147476341 |
| Veterinary Technologist & Technician | 17608.890145072535 |
| Fabricator | 17604.09839886502 |
| Computer Programmer | 17591.551812450747 |
| Middle School Teacher | 17590.884146341465 |
| Physical Therapist | 17590.864870299618 |
| Bus Driver | 17581.65017168249 |
| Security Guard | 17581.07212652362 |
| Financial Advisor | 17574.451710952053 |
| Mental Health Counselor | 17573.052391105695 |
| Pharmacy Technician | 17572.549974937345 |
| Sales Manager | 17565.845252284467 |
| Computer Systems Analyst | 17562.52641202763 |
| Maintenance & Repair Worker | 17555.82058970515 |
| Painter | 17553.907816229115 |
| Administrative Assistant | 17551.82823267126 |
| Lawyer | 17549.650365768113 |
+--------------------------------------+--------------------+
20 rows in set (3.04 sec)
The final comparison:
Hive | Spark | Drill | Mysql | |
Query Time | 5.5s | 2.2s | 1.2s | 3.0s |
As you see, for this scale of dataset (1 million items), for a simple SQL query by grouping and aggregating, Drill is still the fastest. Hive is the slowest, though it’s using the very old MR engine. I didn’t test Tez engine with Hive. Spark is the second fastest, that should be reasonable, since both Spark and Drill have almost the same implementation architecture. Mysql is using InnoDB engine. It was the default installation, I took the only tuning by increasing key_buffer_size to 256MB.
All the queries have been run many times. I choose the average values as final result.