关于Drill的Schema

by 司马顿 | 2022年4月1日下午7:01

Apache Drill加载数据很方便，在加载之前不需要创建schema，它遵循schema by read的原则。

但一旦数据加载进来，要基于这些数据做统计分析，那就还是需要schema。然而，drill只能手工去转成特定的schema。这点甚至不如spark那么方便，因为spark可以infer schema。

我们看看在scahme转换前后的数据的区别。原始数据是这个文件，一份csv数据。

创建临时表1，没有做schema转换：

apache drill (dfs.pyh)> create view tt1 as 
2............semicolon> select name,sex,born,salary from `people.csv`;
+------+-----------------------------------------------------+
|  ok  |                       summary                       |
+------+-----------------------------------------------------+
| true | View 'tt1' created successfully in 'dfs.pyh' schema |
+------+-----------------------------------------------------+

创建临时表2，做了schema的转换（使用cast函数）：

apache drill (dfs.pyh)> create view tt2 as
2............semicolon> select cast(name as varchar) as name,
3............semicolon> cast(sex as varchar) as sex,
4............semicolon> cast(born as date) as born,
5............semicolon> cast(salary as float) as salary
6............semicolon> from `people.csv`;
+------+-----------------------------------------------------+
|  ok  |                       summary                       |
+------+-----------------------------------------------------+
| true | View 'tt2' created successfully in 'dfs.pyh' schema |
+------+-----------------------------------------------------+

查看临时表1的数据结构：

apache drill (dfs.pyh)> desc tt1;
+-------------+-----------+-------------+
| COLUMN_NAME | DATA_TYPE | IS_NULLABLE |
+-------------+-----------+-------------+
| name        | ANY       | YES         |
| sex         | ANY       | YES         |
| born        | ANY       | YES         |
| salary      | ANY       | YES         |
+-------------+-----------+-------------+

再查看临时表2的数据结构：

apache drill (dfs.pyh)> desc tt2;
+-------------+-------------------+-------------+
| COLUMN_NAME |     DATA_TYPE     | IS_NULLABLE |
+-------------+-------------------+-------------+
| name        | CHARACTER VARYING | YES         |
| sex         | CHARACTER VARYING | YES         |
| born        | DATE              | YES         |
| salary      | FLOAT             | YES         |
+-------------+-------------------+-------------+

我估计熟悉Java的同学看到临时表1的数据类型都要头痛了。ANY代表类型丢失，这样啥都做不了。

而临时表2则有了正确的类型，基于这些类型，才能执行各种聚合操作。比如这个简单的平均函数：

apache drill (dfs.pyh)> select avg(salary) as avg_salary from tt2;
+------------+
| avg_salary |
+------------+
| 17583.093  |
+------------+
1 row selected (0.249 seconds)

而这个函数在临时表1是运行不了的，因为它的salary字段的类型丢失了。

这点甚至不如spark方便，因为spark在加载数据时，可以指定一个inferSchema属性，来猜测类型。比如：

scala> val df = spark.read.format("csv").option("inferSchema", "true").option("header", "true").load("tmp/people.csv")
df: org.apache.spark.sql.DataFrame = [NAME: string, SEX: string ... 5 more fields]

scala> df.printSchema
root
 |-- NAME: string (nullable = true)
 |-- SEX: string (nullable = true)
 |-- BORN: string (nullable = true)
 |-- ZIP: integer (nullable = true)
 |-- EMAIL: string (nullable = true)
 |-- JOB: string (nullable = true)
 |-- SALARY: double (nullable = true)

如上，spark在读取的时候，自动做了类型转换，比如zip是Int类型，salary是Double类型。

有了类型，聚合运算就很方便了。同样求平均的函数：

scala> df.agg(avg($"salary")).show
+-----------+
|avg(salary)|
+-----------+
|  17583.093|
+-----------+

我感觉在类型匹配上，drill不如spark方便。虽然drill有更为强大的SQL引擎，但类型适配的弱点，降低了它的可用性。

Source URL: https://smart.postno.de/archives/3734