我发现Drill查询复杂的数据结构(map, array)等,还是挺麻烦的,要做多个临时表的join。
比如这个Json:
{
"first_name":
{
"0":"Robert",
"1":"Steve",
"2":"Anne",
"3":"Alice"
},
"last_name":
{
"0":"Hernandez",
"1":"Smith",
"2":"Raps",
"3":"Muller"
},
"birthday":
{
"0":"5/3/67",
"1":"8/4/84",
"2":"9/13/91",
"3":"4/15/75"
}
}
要按照合理的表格布局来查询,得写如下一大堆SQL:
> select firstName,lastName,birthday from
> (
> select row_number() over(order by '1') as rowID,
> flatten(kvgen(first_name))['value'] as firstName
> from `columns.json`
> ) as t1 join
> (
> select row_number() over(order by '1') as rowID,
> flatten(kvgen(last_name))['value'] as lastName
> from `columns.json`
> ) as t2 on t1.rowID = t2.rowID
> join
> (
> select row_number() over(order by '1') as rowID,
> flatten(kvgen(birthday))['value'] as birthday
> from `columns.json`
> ) as t3 on t1.rowID = t3.rowID;
+-----------+-----------+----------+
| firstName | lastName | birthday |
+-----------+-----------+----------+
| Robert | Hernandez | 5/3/67 |
| Steve | Smith | 8/4/84 |
| Anne | Raps | 9/13/91 |
| Alice | Muller | 4/15/75 |
+-----------+-----------+----------+
4 rows selected (0.368 seconds)
主要在于drill读取Json默认还是压缩到一起的,要调用kvgen和flatten之类的函数,把它们由一行多列的格式,打散成多行一列的格式。
比如默认的select是这种格式:
> select first_name from `columns.json`;
+---------------------------------------------------+
| first_name |
+---------------------------------------------------+
| {"0":"Robert","1":"Steve","2":"Anne","3":"Alice"} |
+---------------------------------------------------+
1 row selected (0.141 seconds)
调用kvgen(产生map)和flatten(打散)函数后,才能得到这种格式:
> select flatten(kvgen(first_name)) from `columns.json`;
+------------------------------+
| EXPR$0 |
+------------------------------+
| {"key":"0","value":"Robert"} |
| {"key":"1","value":"Steve"} |
| {"key":"2","value":"Anne"} |
| {"key":"3","value":"Alice"} |
+------------------------------+
4 rows selected (0.14 seconds)
当然可以单独3个select得到各个列的数据:
> select row_number() over (order by '1') as rowID, flatten(kvgen(first_name))['value'] as firstName from `columns.json`;
+-------+-----------+
| rowID | firstName |
+-------+-----------+
| 1 | Robert |
| 2 | Steve |
| 3 | Anne |
| 4 | Alice |
+-------+-----------+
4 rows selected (0.176 seconds)
> select row_number() over (order by '1') as rowID, flatten(kvgen(last_name))['value'] as lastName from `columns.json`;
+-------+-----------+
| rowID | lastName |
+-------+-----------+
| 1 | Hernandez |
| 2 | Smith |
| 3 | Raps |
| 4 | Muller |
+-------+-----------+
4 rows selected (0.155 seconds)
> select row_number() over (order by '1') as rowID, flatten(kvgen(birthday))['value'] as birthday from `columns.json`;
+-------+----------+
| rowID | birthday |
+-------+----------+
| 1 | 5/3/67 |
| 2 | 8/4/84 |
| 3 | 9/13/91 |
| 4 | 4/15/75 |
+-------+----------+
4 rows selected (0.124 seconds)
但要把它们join到一起,就要按照第一个SQL代码去执行了。这样的SQL写起来很长,也比较费力。不过相比起来,drill能把Json按照SQL语句去解析,已经很强大了。这样对无处不在的Json数据的查询,就有了趁手的工具。