关于Drill的Schema

Apache Drill加载数据很方便,在加载之前不需要创建schema,它遵循schema by read的原则。

但一旦数据加载进来,要基于这些数据做统计分析,那就还是需要schema。然而,drill只能手工去转成特定的schema。这点甚至不如spark那么方便,因为spark可以infer schema。

我们看看在scahme转换前后的数据的区别。原始数据是这个文件,一份csv数据。

创建临时表1,没有做schema转换:

apache drill (dfs.pyh)> create view tt1 as 
2............semicolon> select name,sex,born,salary from `people.csv`;
+------+-----------------------------------------------------+
|  ok  |                       summary                       |
+------+-----------------------------------------------------+
| true | View 'tt1' created successfully in 'dfs.pyh' schema |
+------+-----------------------------------------------------+

创建临时表2,做了schema的转换(使用cast函数):

apache drill (dfs.pyh)> create view tt2 as
2............semicolon> select cast(name as varchar) as name,
3............semicolon> cast(sex as varchar) as sex,
4............semicolon> cast(born as date) as born,
5............semicolon> cast(salary as float) as salary
6............semicolon> from `people.csv`;
+------+-----------------------------------------------------+
|  ok  |                       summary                       |
+------+-----------------------------------------------------+
| true | View 'tt2' created successfully in 'dfs.pyh' schema |
+------+-----------------------------------------------------+

查看临时表1的数据结构:

apache drill (dfs.pyh)> desc tt1;
+-------------+-----------+-------------+
| COLUMN_NAME | DATA_TYPE | IS_NULLABLE |
+-------------+-----------+-------------+
| name        | ANY       | YES         |
| sex         | ANY       | YES         |
| born        | ANY       | YES         |
| salary      | ANY       | YES         |
+-------------+-----------+-------------+

再查看临时表2的数据结构:

apache drill (dfs.pyh)> desc tt2;
+-------------+-------------------+-------------+
| COLUMN_NAME |     DATA_TYPE     | IS_NULLABLE |
+-------------+-------------------+-------------+
| name        | CHARACTER VARYING | YES         |
| sex         | CHARACTER VARYING | YES         |
| born        | DATE              | YES         |
| salary      | FLOAT             | YES         |
+-------------+-------------------+-------------+

我估计熟悉Java的同学看到临时表1的数据类型都要头痛了。ANY代表类型丢失,这样啥都做不了。

而临时表2则有了正确的类型,基于这些类型,才能执行各种聚合操作。比如这个简单的平均函数:

apache drill (dfs.pyh)> select avg(salary) as avg_salary from tt2;
+------------+
| avg_salary |
+------------+
| 17583.093  |
+------------+
1 row selected (0.249 seconds)

而这个函数在临时表1是运行不了的,因为它的salary字段的类型丢失了。

这点甚至不如spark方便,因为spark在加载数据时,可以指定一个inferSchema属性,来猜测类型。比如:

scala> val df = spark.read.format("csv").option("inferSchema", "true").option("header", "true").load("tmp/people.csv")
df: org.apache.spark.sql.DataFrame = [NAME: string, SEX: string ... 5 more fields]

scala> df.printSchema
root
 |-- NAME: string (nullable = true)
 |-- SEX: string (nullable = true)
 |-- BORN: string (nullable = true)
 |-- ZIP: integer (nullable = true)
 |-- EMAIL: string (nullable = true)
 |-- JOB: string (nullable = true)
 |-- SALARY: double (nullable = true)

如上,spark在读取的时候,自动做了类型转换,比如zip是Int类型,salary是Double类型。

有了类型,聚合运算就很方便了。同样求平均的函数:

scala> df.agg(avg($"salary")).show
+-----------+
|avg(salary)|
+-----------+
|  17583.093|
+-----------+

我感觉在类型匹配上,drill不如spark方便。虽然drill有更为强大的SQL引擎,但类型适配的弱点,降低了它的可用性。

Print this entry