Hive和Drill的高级数据结构

除了传统的SQL字段类型,Hive和Drill都支持另外的高级数据结构,比如map和array。当然Hive还支持一种叫做struct的结构。

这些数据结构挺灵活的,表现力更强。不过如果熟悉了关系型数据库的SQL,一般不会想起这些结构。有需求直接通过设计不同的数据表来解决了。

Hive的结构字段需要预先定义,在创建表的时候定义,比如:

create table foods (
      name string,
      color array<string>,
      class map<string,int>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':'
STORED AS TEXTFILE;

上述,创建了一个foods(食品)表,它包含三个字段:

  • name: 食品名,是一个string
  • color: 食品颜色,是一个array,表示可以有多种颜色
  • class: 食品分类,是一个map,k/v键值的形式,成对的出现

后面那一堆附加的描述至关重要,表示加载外部数据时,指定的各种分隔符。说明如下:

  • FIELDS TERMINATED BY ‘|’ // 指定字段的分隔符是竖杠
  • COLLECTION ITEMS TERMINATED BY ‘,’ // 指定同一个字段里,各个元素的分隔符是逗号,比如数组元素
  • MAP KEYS TERMINATED BY ‘:’ // 指定map的k/v分隔符号是冒号

下面看看原始数据长啥样子:

orange|red,yellow|cat:1,grp:1
apple|red,brown|cat:1,grp:2
cherry|red,purple|cat:1,grp:3
plum|yellow,black|cat:1,grp:4
rice|white,black|cat:2,grp:1
toast|yellow,brown|cat:2,grp:2
bread|yellow,white|cat:2,grp:3

这就是一个文本文件,每行一个记录。各个字段之间,用竖杠分隔开。数组的分隔符号是逗号。在map里,k/v的分隔符号是冒号,多个k/v之间,也用逗号隔开。

把这个文件上传到HDFS,然后在Hive里加载文件到数据表:

> LOAD DATA INPATH '/tmp/test/foods.txt' OVERWRITE INTO TABLE foods;

接着运行基本的SQL查询。查询数组元素:

> select * from foods where color[0] = 'yellow';
+-------------+---------------------+--------------------+
| foods.name  |     foods.color     |    foods.class     |
+-------------+---------------------+--------------------+
| plum        | ["yellow","black"]  | {"cat":1,"grp":4}  |
| toast       | ["yellow","brown"]  | {"cat":2,"grp":2}  |
| bread       | ["yellow","white"]  | {"cat":2,"grp":3}  |
+-------------+---------------------+--------------------+
3 rows selected (2.845 seconds)

查询map元素:

> select * from foods where class['cat'] =1;
+-------------+---------------------+--------------------+
| foods.name  |     foods.color     |    foods.class     |
+-------------+---------------------+--------------------+
| orange      | ["red","yellow"]    | {"cat":1,"grp":1}  |
| apple       | ["red","brown"]     | {"cat":1,"grp":2}  |
| cherry      | ["red","purple"]    | {"cat":1,"grp":3}  |
| plum        | ["yellow","black"]  | {"cat":1,"grp":4}  |
+-------------+---------------------+--------------------+
4 rows selected (0.204 seconds)

对于array和map的访问方式,跟你熟悉的程序语言的访问方式差不多。在使用上,Hive的这种高级数据结构还是既方便又灵活。

在Drill里,数据结构的生成依赖于Json。也就是说,外部文件需要是Json格式。比如这个Json:

{ "name": "orange",
  "color": [ "red", "yellow" ],
  "class": { "cat":1,
             "grp":1
           }
}
{ "name": "cherry",
  "color": [ "purple", "red" ],
  "class": { "cat":1,
             "grp":2
           }
}
{ "name": "plum",
  "color": [ "black", "red" ],
  "class": { "cat":1,
             "grp":3
           }
}

在Drill里直接查询这个Json文件,即可生成相关数据结构。

apache drill (dfs.pyh)> select * from `foods.json`;
+--------+------------------+-------------------+
|  name  |      color       |       class       |
+--------+------------------+-------------------+
| orange | ["red","yellow"] | {"cat":1,"grp":1} |
| cherry | ["purple","red"] | {"cat":1,"grp":2} |
| plum   | ["black","red"]  | {"cat":1,"grp":3} |
+--------+------------------+-------------------+
3 rows selected (0.173 seconds)

上述,color是一个array结构,class是一个map结构。它们的访问方式,跟Hive差不多。

> select name,color[0] as priColor, class['grp'] as priGroup from `foods.json`;
+--------+----------+----------+
|  name  | priColor | priGroup |
+--------+----------+----------+
| orange | red      | 1        |
| cherry | purple   | 2        |
| plum   | black    | 3        |
+--------+----------+----------+
3 rows selected (0.141 seconds)

从使用方便性看,如果用到这些高级结构,Drill对外部文件的约束是Json格式,Hive对外部文件的约束通过创建表时指定格式来限制。无疑Drill更方便一点,因为现实世界里有太多的Json文件了。

Print this entry