by 司马顿 | 2022年4月4日 上午11:47
上篇博客讲了Drill 1.20配置Hive3存储插件,过程还是有点折腾的。
配置完后,我顺便配了下Mysql存储插件,这个就很简单了。
把mysql连接器程序放到drill的classpath下,然后启动drill-embedded,打开web UI,在默认的mysql插件那里,修改如下几个值:
"url": "jdbc:mysql://localhost:3306",
"username": "spark",
"password": "spark123",
也就是访问Mysql的相关凭证。改完后再在web UI里激活这个插件即可。
我们回到drill shell,可以如下访问Mysql的内容:
apache drill (mysql.spark)> use mysql.spark;
+------+-----------------------------------------+
| ok | summary |
+------+-----------------------------------------+
| true | Default schema changed to [mysql.spark] |
+------+-----------------------------------------+
1 row selected (0.1 seconds)
apache drill (mysql.spark)> show tables;
+--------------+---------------------------+
| TABLE_SCHEMA | TABLE_NAME |
+--------------+---------------------------+
| mysql.spark | AUX_TABLE |
| mysql.spark | BUCKETING_COLS |
这是一个简单的select:
apache drill (mysql.spark)> select name,sum(number) as dd from fruit group by name order by dd desc limit 10;
+-----------+-----+
| name | dd |
+-----------+-----+
| blueberry | 166 |
| cumquat | 145 |
| lemon | 145 |
| areca nut | 139 |
| haw | 137 |
| greengage | 134 |
| banana | 134 |
| raspberry | 133 |
| longan | 132 |
| nectarine | 128 |
+-----------+-----+
10 rows selected (0.155 seconds)
这看起来怪怪的,在drill里去对Mysql执行SQL查询。殊不知Mysql才是这个领域的王者,它的SQL性能无与伦比。同样对上述查询,Mysql的时间永远是0秒。
mysql> select name,sum(number) as dd from fruit group by name order by dd desc limit 10;
+-----------+------+
| name | dd |
+-----------+------+
| blueberry | 166 |
| cumquat | 145 |
| lemon | 145 |
| areca nut | 139 |
| haw | 137 |
| banana | 134 |
| greengage | 134 |
| raspberry | 133 |
| longan | 132 |
| nectarine | 128 |
+-----------+------+
10 rows in set (0.00 sec)
要论SQL引擎的性能,Mysql经过这么多年的发展,其查询性能绝对领先于一堆的大数据SQL引擎。什么Spark, Hive, Drill, Impala,在非海量数据的场景下,查询性能都比不上Mysql。
Drill这里整合了Mysql的查询,当然不是没有用处。比如,它可以做多数据源的跨表联合查询。左边是Mysql表,右边是Hive表,再join起来查询,这个在某些场景下,就比较有用了。
Source URL: https://smart.postno.de/archives/3751
Copyright ©2024 司马顿的博客 unless otherwise noted.