除了传统的SQL字段类型,Hive和Drill都支持另外的高级数据结构,比如map和array。当然Hive还支持一种叫做struct的结构。
这些数据结构挺灵活的,表现力更强。不过如果熟悉了关系型数据库的SQL,一般不会想起这些结构。有需求直接通过设计不同的数据表来解决了。
Hive的结构字段需要预先定义,在创建表的时候定义,比如:
create table foods (
name string,
color array<string>,
class map<string,int>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':'
STORED AS TEXTFILE;
上述,创建了一个foods(食品)表,它包含三个字段:
- name: 食品名,是一个string
- color: 食品颜色,是一个array,表示可以有多种颜色
- class: 食品分类,是一个map,k/v键值的形式,成对的出现
后面那一堆附加的描述至关重要,表示加载外部数据时,指定的各种分隔符。说明如下:
- FIELDS TERMINATED BY ‘|’ // 指定字段的分隔符是竖杠
- COLLECTION ITEMS TERMINATED BY ‘,’ // 指定同一个字段里,各个元素的分隔符是逗号,比如数组元素
- MAP KEYS TERMINATED BY ‘:’ // 指定map的k/v分隔符号是冒号
下面看看原始数据长啥样子:
orange|red,yellow|cat:1,grp:1
apple|red,brown|cat:1,grp:2
cherry|red,purple|cat:1,grp:3
plum|yellow,black|cat:1,grp:4
rice|white,black|cat:2,grp:1
toast|yellow,brown|cat:2,grp:2
bread|yellow,white|cat:2,grp:3
这就是一个文本文件,每行一个记录。各个字段之间,用竖杠分隔开。数组的分隔符号是逗号。在map里,k/v的分隔符号是冒号,多个k/v之间,也用逗号隔开。
把这个文件上传到HDFS,然后在Hive里加载文件到数据表:
> LOAD DATA INPATH '/tmp/test/foods.txt' OVERWRITE INTO TABLE foods;
接着运行基本的SQL查询。查询数组元素:
> select * from foods where color[0] = 'yellow';
+-------------+---------------------+--------------------+
| foods.name | foods.color | foods.class |
+-------------+---------------------+--------------------+
| plum | ["yellow","black"] | {"cat":1,"grp":4} |
| toast | ["yellow","brown"] | {"cat":2,"grp":2} |
| bread | ["yellow","white"] | {"cat":2,"grp":3} |
+-------------+---------------------+--------------------+
3 rows selected (2.845 seconds)
查询map元素:
> select * from foods where class['cat'] =1;
+-------------+---------------------+--------------------+
| foods.name | foods.color | foods.class |
+-------------+---------------------+--------------------+
| orange | ["red","yellow"] | {"cat":1,"grp":1} |
| apple | ["red","brown"] | {"cat":1,"grp":2} |
| cherry | ["red","purple"] | {"cat":1,"grp":3} |
| plum | ["yellow","black"] | {"cat":1,"grp":4} |
+-------------+---------------------+--------------------+
4 rows selected (0.204 seconds)
对于array和map的访问方式,跟你熟悉的程序语言的访问方式差不多。在使用上,Hive的这种高级数据结构还是既方便又灵活。
在Drill里,数据结构的生成依赖于Json。也就是说,外部文件需要是Json格式。比如这个Json:
{ "name": "orange",
"color": [ "red", "yellow" ],
"class": { "cat":1,
"grp":1
}
}
{ "name": "cherry",
"color": [ "purple", "red" ],
"class": { "cat":1,
"grp":2
}
}
{ "name": "plum",
"color": [ "black", "red" ],
"class": { "cat":1,
"grp":3
}
}
在Drill里直接查询这个Json文件,即可生成相关数据结构。
apache drill (dfs.pyh)> select * from `foods.json`;
+--------+------------------+-------------------+
| name | color | class |
+--------+------------------+-------------------+
| orange | ["red","yellow"] | {"cat":1,"grp":1} |
| cherry | ["purple","red"] | {"cat":1,"grp":2} |
| plum | ["black","red"] | {"cat":1,"grp":3} |
+--------+------------------+-------------------+
3 rows selected (0.173 seconds)
上述,color是一个array结构,class是一个map结构。它们的访问方式,跟Hive差不多。
> select name,color[0] as priColor, class['grp'] as priGroup from `foods.json`;
+--------+----------+----------+
| name | priColor | priGroup |
+--------+----------+----------+
| orange | red | 1 |
| cherry | purple | 2 |
| plum | black | 3 |
+--------+----------+----------+
3 rows selected (0.141 seconds)
从使用方便性看,如果用到这些高级结构,Drill对外部文件的约束是Json格式,Hive对外部文件的约束通过创建表时指定格式来限制。无疑Drill更方便一点,因为现实世界里有太多的Json文件了。