Apache Drill加载数据很方便,在加载之前不需要创建schema,它遵循schema by read的原则。
但一旦数据加载进来,要基于这些数据做统计分析,那就还是需要schema。然而,drill只能手工去转成特定的schema。这点甚至不如spark那么方便,因为spark可以infer schema。
我们看看在scahme转换前后的数据的区别。原始数据是这个文件,一份csv数据。
创建临时表1,没有做schema转换:
apache drill (dfs.pyh)> create view tt1 as
2............semicolon> select name,sex,born,salary from `people.csv`;
+------+-----------------------------------------------------+
| ok | summary |
+------+-----------------------------------------------------+
| true | View 'tt1' created successfully in 'dfs.pyh' schema |
+------+-----------------------------------------------------+
创建临时表2,做了schema的转换(使用cast函数):
apache drill (dfs.pyh)> create view tt2 as
2............semicolon> select cast(name as varchar) as name,
3............semicolon> cast(sex as varchar) as sex,
4............semicolon> cast(born as date) as born,
5............semicolon> cast(salary as float) as salary
6............semicolon> from `people.csv`;
+------+-----------------------------------------------------+
| ok | summary |
+------+-----------------------------------------------------+
| true | View 'tt2' created successfully in 'dfs.pyh' schema |
+------+-----------------------------------------------------+
查看临时表1的数据结构:
apache drill (dfs.pyh)> desc tt1;
+-------------+-----------+-------------+
| COLUMN_NAME | DATA_TYPE | IS_NULLABLE |
+-------------+-----------+-------------+
| name | ANY | YES |
| sex | ANY | YES |
| born | ANY | YES |
| salary | ANY | YES |
+-------------+-----------+-------------+
再查看临时表2的数据结构:
apache drill (dfs.pyh)> desc tt2;
+-------------+-------------------+-------------+
| COLUMN_NAME | DATA_TYPE | IS_NULLABLE |
+-------------+-------------------+-------------+
| name | CHARACTER VARYING | YES |
| sex | CHARACTER VARYING | YES |
| born | DATE | YES |
| salary | FLOAT | YES |
+-------------+-------------------+-------------+
我估计熟悉Java的同学看到临时表1的数据类型都要头痛了。ANY代表类型丢失,这样啥都做不了。
而临时表2则有了正确的类型,基于这些类型,才能执行各种聚合操作。比如这个简单的平均函数:
apache drill (dfs.pyh)> select avg(salary) as avg_salary from tt2;
+------------+
| avg_salary |
+------------+
| 17583.093 |
+------------+
1 row selected (0.249 seconds)
而这个函数在临时表1是运行不了的,因为它的salary字段的类型丢失了。
这点甚至不如spark方便,因为spark在加载数据时,可以指定一个inferSchema属性,来猜测类型。比如:
scala> val df = spark.read.format("csv").option("inferSchema", "true").option("header", "true").load("tmp/people.csv")
df: org.apache.spark.sql.DataFrame = [NAME: string, SEX: string ... 5 more fields]
scala> df.printSchema
root
|-- NAME: string (nullable = true)
|-- SEX: string (nullable = true)
|-- BORN: string (nullable = true)
|-- ZIP: integer (nullable = true)
|-- EMAIL: string (nullable = true)
|-- JOB: string (nullable = true)
|-- SALARY: double (nullable = true)
如上,spark在读取的时候,自动做了类型转换,比如zip是Int类型,salary是Double类型。
有了类型,聚合运算就很方便了。同样求平均的函数:
scala> df.agg(avg($"salary")).show
+-----------+
|avg(salary)|
+-----------+
| 17583.093|
+-----------+
我感觉在类型匹配上,drill不如spark方便。虽然drill有更为强大的SQL引擎,但类型适配的弱点,降低了它的可用性。