Drill的Mysql存储插件 – 司马顿的博客

上篇博客讲了Drill 1.20配置Hive3存储插件，过程还是有点折腾的。

配置完后，我顺便配了下Mysql存储插件，这个就很简单了。

把mysql连接器程序放到drill的classpath下，然后启动drill-embedded，打开web UI，在默认的mysql插件那里，修改如下几个值：

  "url": "jdbc:mysql://localhost:3306",
  "username": "spark",
  "password": "spark123",

也就是访问Mysql的相关凭证。改完后再在web UI里激活这个插件即可。

我们回到drill shell，可以如下访问Mysql的内容：

apache drill (mysql.spark)> use mysql.spark;
+------+-----------------------------------------+
|  ok  |                 summary                 |
+------+-----------------------------------------+
| true | Default schema changed to [mysql.spark] |
+------+-----------------------------------------+
1 row selected (0.1 seconds)

apache drill (mysql.spark)> show tables;
+--------------+---------------------------+
| TABLE_SCHEMA |        TABLE_NAME         |
+--------------+---------------------------+
| mysql.spark  | AUX_TABLE                 |
| mysql.spark  | BUCKETING_COLS            |

这是一个简单的select:

apache drill (mysql.spark)> select name,sum(number) as dd from fruit group by name order by dd desc limit 10;
+-----------+-----+
|   name    | dd  |
+-----------+-----+
| blueberry | 166 |
| cumquat   | 145 |
| lemon     | 145 |
| areca nut | 139 |
| haw       | 137 |
| greengage | 134 |
| banana    | 134 |
| raspberry | 133 |
| longan    | 132 |
| nectarine | 128 |
+-----------+-----+
10 rows selected (0.155 seconds)

这看起来怪怪的，在drill里去对Mysql执行SQL查询。殊不知Mysql才是这个领域的王者，它的SQL性能无与伦比。同样对上述查询，Mysql的时间永远是0秒。

mysql> select name,sum(number) as dd from fruit group by name order by dd desc limit 10;
+-----------+------+
| name      | dd   |
+-----------+------+
| blueberry |  166 |
| cumquat   |  145 |
| lemon     |  145 |
| areca nut |  139 |
| haw       |  137 |
| banana    |  134 |
| greengage |  134 |
| raspberry |  133 |
| longan    |  132 |
| nectarine |  128 |
+-----------+------+
10 rows in set (0.00 sec)

要论SQL引擎的性能，Mysql经过这么多年的发展，其查询性能绝对领先于一堆的大数据SQL引擎。什么Spark, Hive, Drill, Impala，在非海量数据的场景下，查询性能都比不上Mysql。

Drill这里整合了Mysql的查询，当然不是没有用处。比如，它可以做多数据源的跨表联合查询。左边是Mysql表，右边是Hive表，再join起来查询，这个在某些场景下，就比较有用了。

Print this entry