使用默认mr引擎的Hive实在太慢

by 司马顿 | 2022年3月29日 下午4:50

在过去Hadoop时代,Hive几乎是默认的Hadoop数据的查询引擎。当然Hive主要做离线查询使用,它是针对数据规模设计的,对交互式查询的能力没考虑那么多。

如果使用Hadoop默认的mr引擎,Hive实在太慢了。我这里用Hive和另一款类似的SQL引擎,Apache Drill做一个对比。

测试数据来自这里,是一个csv文件,包含10000条个人数据。共有5栏,分别是姓名、生日、性别、电话、工作。这是一份程序生成的fake数据。

Hive加载csv复杂一些,先要将数据导入到HDFS,然后再创建table,再load数据到table里。

将数据上传到本机部署的HDFS:

$ hdfs dfs -put people.csv /tmp/test/

登陆Hive shell,在Hive里创建一个table:

create table ppl (
  name string,
  born date,
  sex string,
  tel int,
  job string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

上述创建了名为ppl的table,包含了5列:name, born, sex, tel, job。同时指定了列的类型。

因为csv是用逗号分隔的,所以指定一个属性:FIELDS TERMINATED BY ‘,’

接下来,使用如下语句,从HDFS加载csv文件到table:

LOAD DATA INPATH '/tmp/test/people.csv' OVERWRITE INTO TABLE ppl;

table数据有了,就可以运行sql查询了。我在beeline里使用的测试sql语句如下:

0: jdbc:hive2://localhost:10000/default> select job,count(*) as dd from ppl group by job order by dd desc;
+---------------------------------------+------+
|                  job                  |  dd  |
+---------------------------------------+------+
| Market Research Analyst               | 122  |
| Patrol Officer                        | 122  |
| Administrative Assistant              | 120  |
| Pharmacy Technician                   | 119  |
| Paramedic                             | 117  |
| Personal Care Aide                    | 116  |
| Sales Manager                         | 115  |
| Restaurant Cook                       | 114  |
| Financial Manager                     | 113  |
| Mechanical Engineer                   | 113  |
| Clinical Social Worker                | 112  |
| Loan Officer                          | 112  |
| Substance Abuse Counselor             | 112  |
| Speech-Language Pathologist           | 112  |
| High School Teacher                   | 112  |
| Epidemiologist                        | 111  |
| Cost Estimator                        | 111  |
| Financial Analyst                     | 111  |
| Medical Secretary                     | 110  |
| Dentist                               | 110  |
| Clinical Laboratory Technician        | 110  |
| Recreation & Fitness Worker           | 109  |
| Insurance Agent                       | 109  |
| Medical Assistant                     | 109  |
| Public Relations Specialist           | 108  |
| Marriage & Family Therapist           | 108  |
| Database Administrator                | 107  |
| Architect                             | 107  |
| Pharmacist                            | 107  |
| Glazier                               | 106  |
| Diagnostic Medical Sonographer        | 105  |
| Compliance Officer                    | 105  |
| Computer Systems Analyst              | 105  |
| Auto Mechanic                         | 105  |
| Telemarketer                          | 104  |
| Hairdresser                           | 104  |
| Logistician                           | 104  |
| Physical Therapist                    | 103  |
| Meeting Convention & Event Planner    | 103  |
| Carpenter                             | 103  |
| Sports Coach                          | 102  |
| Security Guard                        | 102  |
| Bus Driver                            | 102  |
| Dental Assistant                      | 102  |
| Sales Representative                  | 102  |
| Dental Hygienist                      | 101  |
| School Psychologist                   | 101  |
| Preschool Teacher                     | 101  |
| Paralegal                             | 101  |
| Elementary School Teacher             | 101  |
| Maintenance & Repair Worker           | 100  |
| Fabricator                            | 100  |
| Veterinarian                          | 100  |
| Painter                               | 100  |
| Landscaper & Groundskeeper            | 100  |
| Marketing Manager                     | 99   |
| Middle School Teacher                 | 99   |
| School Counselor                      | 99   |
| Home Health Aide                      | 99   |
| Executive Assistant                   | 99   |
| Construction Worker                   | 99   |
| Software Developer                    | 98   |
| Brickmason & Blockmason               | 98   |
| Physical Therapist Assistant          | 98   |
| Art Director                          | 97   |
| Construction Manager                  | 97   |
| Customer Service Representative       | 97   |
| Respiratory Therapist                 | 96   |
| Veterinary Technologist & Technician  | 96   |
| Massage Therapist                     | 96   |
| Radiologic Technologist               | 95   |
| Registered Nurse                      | 95   |
| Civil Engineer                        | 94   |
| Accountant                            | 93   |
| Real Estate Agent                     | 93   |
| Janitor                               | 92   |
| Financial Advisor                     | 92   |
| Occupational Therapist                | 92   |
| Delivery Truck Driver                 | 92   |
| Electrician                           | 91   |
| Child & Family Social Worker          | 90   |
| Management Analyst                    | 90   |
| Nursing Aide                          | 90   |
| Teacher Assistant                     | 89   |
| Business Operations Manager           | 89   |
| Web Developer                         | 87   |
| Cement Mason & Concrete Finisher      | 86   |
| Mental Health Counselor               | 86   |
| Maid & Housekeeper                    | 85   |
| Physician                             | 84   |
| Cashier                               | 84   |
| HR Specialist                         | 83   |
| IT Manager                            | 83   |
| Interpreter & Translator              | 82   |
| Lawyer                                | 82   |
| Computer Systems Administrator        | 81   |
| Computer Programmer                   | 81   |
| Receptionist                          | 79   |
| Plumber                               | 77   |
| Bookkeeping Accounting & Audit Clerk  | 76   |
+---------------------------------------+------+
100 rows selected (2.66 seconds)

如上,可以看到清晰的查询结果,以及查询耗时:2.66秒。

再把数据导入到drill进行测试。这个十分简单,因为drill不需要定义schema,也就不需要创建表的过程。并且drill可以直接加载本地文件,无需依赖HDFS。

运行drill-embedded打开本机的一个drill实例,运行如下查询:

apache drill> select columns[4] as job, count(*) as dd from dfs.`/home/xxx/tmp/people.csv` group by job order by dd desc;
+--------------------------------------+-----+
|                 job                  | dd  |
+--------------------------------------+-----+
| Market Research Analyst              | 122 |
| Patrol Officer                       | 122 |
| Administrative Assistant             | 120 |
| Pharmacy Technician                  | 119 |
| Paramedic                            | 117 |
| Personal Care Aide                   | 116 |
| Sales Manager                        | 115 |
| Restaurant Cook                      | 114 |
| Mechanical Engineer                  | 113 |
| Financial Manager                    | 113 |
| Substance Abuse Counselor            | 112 |
| High School Teacher                  | 112 |
| Loan Officer                         | 112 |
| Clinical Social Worker               | 112 |
| Speech-Language Pathologist          | 112 |
| Cost Estimator                       | 111 |
| Financial Analyst                    | 111 |
| Epidemiologist                       | 111 |
| Clinical Laboratory Technician       | 110 |
| Medical Secretary                    | 110 |
| Dentist                              | 110 |
| Medical Assistant                    | 109 |
| Recreation & Fitness Worker          | 109 |
| Insurance Agent                      | 109 |
| Public Relations Specialist          | 108 |
| Marriage & Family Therapist          | 108 |
| Architect                            | 107 |
| Pharmacist                           | 107 |
| Database Administrator               | 107 |
| Glazier                              | 106 |
| Auto Mechanic                        | 105 |
| Computer Systems Analyst             | 105 |
| Compliance Officer                   | 105 |
| Diagnostic Medical Sonographer       | 105 |
| Hairdresser                          | 104 |
| Telemarketer                         | 104 |
| Logistician                          | 104 |
| Meeting Convention & Event Planner   | 103 |
| Physical Therapist                   | 103 |
| Carpenter                            | 103 |
| Bus Driver                           | 102 |
| Sales Representative                 | 102 |
| Dental Assistant                     | 102 |
| Security Guard                       | 102 |
| Sports Coach                         | 102 |
| Elementary School Teacher            | 101 |
| Dental Hygienist                     | 101 |
| Preschool Teacher                    | 101 |
| Paralegal                            | 101 |
| School Psychologist                  | 101 |
| Landscaper & Groundskeeper           | 100 |
| Painter                              | 100 |
| Maintenance & Repair Worker          | 100 |
| Fabricator                           | 100 |
| Veterinarian                         | 100 |
| Middle School Teacher                | 99  |
| School Counselor                     | 99  |
| Marketing Manager                    | 99  |
| Construction Worker                  | 99  |
| Home Health Aide                     | 99  |
| Executive Assistant                  | 99  |
| Physical Therapist Assistant         | 98  |
| Brickmason & Blockmason              | 98  |
| Software Developer                   | 98  |
| Art Director                         | 97  |
| Customer Service Representative      | 97  |
| Construction Manager                 | 97  |
| Respiratory Therapist                | 96  |
| Massage Therapist                    | 96  |
| Veterinary Technologist & Technician | 96  |
| Registered Nurse                     | 95  |
| Radiologic Technologist              | 95  |
| Civil Engineer                       | 94  |
| Accountant                           | 93  |
| Real Estate Agent                    | 93  |
| Financial Advisor                    | 92  |
| Janitor                              | 92  |
| Delivery Truck Driver                | 92  |
| Occupational Therapist               | 92  |
| Electrician                          | 91  |
| Child & Family Social Worker         | 90  |
| Nursing Aide                         | 90  |
| Management Analyst                   | 90  |
| Business Operations Manager          | 89  |
| Teacher Assistant                    | 89  |
| Web Developer                        | 87  |
| Mental Health Counselor              | 86  |
| Cement Mason & Concrete Finisher     | 86  |
| Maid & Housekeeper                   | 85  |
| Physician                            | 84  |
| Cashier                              | 84  |
| HR Specialist                        | 83  |
| IT Manager                           | 83  |
| Interpreter & Translator             | 82  |
| Lawyer                               | 82  |
| Computer Systems Administrator       | 81  |
| Computer Programmer                  | 81  |
| Receptionist                         | 79  |
| Plumber                              | 77  |
| Bookkeeping Accounting & Audit Clerk | 76  |
+--------------------------------------+-----+
100 rows selected (0.204 seconds)

我们也可以看到清晰的查询结果,以及查询耗时:0.204秒。

这个简单对比,说明Hive实在太慢了。是Apache drill的十倍之慢。

与此同时,再对比下spark的查询性能。打开spark-shell,首先加载文件到spark:

scala> val df = spark.read.format("csv").option("inferSchema", "true").option("header", "false").load("/home/xxx/tmp/people.csv");

接着运行如下查询:

scala> df.groupBy("_c4").count.orderBy(desc("count")).show(100,false)
+------------------------------------+-----+
|_c4                                 |count|
+------------------------------------+-----+
|Patrol Officer                      |122  |
|Market Research Analyst             |122  |
|Administrative Assistant            |120  |
|Pharmacy Technician                 |119  |
|Paramedic                           |117  |
|Personal Care Aide                  |116  |
|Sales Manager                       |115  |
|Restaurant Cook                     |114  |
|Financial Manager                   |113  |
|Mechanical Engineer                 |113  |
|Loan Officer                        |112  |
|High School Teacher                 |112  |
|Clinical Social Worker              |112  |
|Substance Abuse Counselor           |112  |
|Speech-Language Pathologist         |112  |
|Epidemiologist                      |111  |
|Cost Estimator                      |111  |
|Financial Analyst                   |111  |
|Medical Secretary                   |110  |
|Clinical Laboratory Technician      |110  |
|Dentist                             |110  |
|Medical Assistant                   |109  |
|Insurance Agent                     |109  |
|Recreation & Fitness Worker         |109  |
|Public Relations Specialist         |108  |
|Marriage & Family Therapist         |108  |
|Architect                           |107  |
|Database Administrator              |107  |
|Pharmacist                          |107  |
|Glazier                             |106  |
|Computer Systems Analyst            |105  |
|Diagnostic Medical Sonographer      |105  |
|Auto Mechanic                       |105  |
|Compliance Officer                  |105  |
|Hairdresser                         |104  |
|Logistician                         |104  |
|Telemarketer                        |104  |
|Carpenter                           |103  |
|Meeting Convention & Event Planner  |103  |
|Physical Therapist                  |103  |
|Security Guard                      |102  |
|Bus Driver                          |102  |
|Dental Assistant                    |102  |
|Sports Coach                        |102  |
|Sales Representative                |102  |
|School Psychologist                 |101  |
|Dental Hygienist                    |101  |
|Elementary School Teacher           |101  |
|Paralegal                           |101  |
|Preschool Teacher                   |101  |
|Painter                             |100  |
|Veterinarian                        |100  |
|Fabricator                          |100  |
|Maintenance & Repair Worker         |100  |
|Landscaper & Groundskeeper          |100  |
|Marketing Manager                   |99   |
|School Counselor                    |99   |
|Executive Assistant                 |99   |
|Construction Worker                 |99   |
|Home Health Aide                    |99   |
|Middle School Teacher               |99   |
|Brickmason & Blockmason             |98   |
|Physical Therapist Assistant        |98   |
|Software Developer                  |98   |
|Customer Service Representative     |97   |
|Art Director                        |97   |
|Construction Manager                |97   |
|Massage Therapist                   |96   |
|Veterinary Technologist & Technician|96   |
|Respiratory Therapist               |96   |
|Radiologic Technologist             |95   |
|Registered Nurse                    |95   |
|Civil Engineer                      |94   |
|Real Estate Agent                   |93   |
|Accountant                          |93   |
|Janitor                             |92   |
|Financial Advisor                   |92   |
|Delivery Truck Driver               |92   |
|Occupational Therapist              |92   |
|Electrician                         |91   |
|Management Analyst                  |90   |
|Nursing Aide                        |90   |
|Child & Family Social Worker        |90   |
|Business Operations Manager         |89   |
|Teacher Assistant                   |89   |
|Web Developer                       |87   |
|Cement Mason & Concrete Finisher    |86   |
|Mental Health Counselor             |86   |
|Maid & Housekeeper                  |85   |
|Physician                           |84   |
|Cashier                             |84   |
|IT Manager                          |83   |
|HR Specialist                       |83   |
|Interpreter & Translator            |82   |
|Lawyer                              |82   |
|Computer Programmer                 |81   |
|Computer Systems Administrator      |81   |
|Receptionist                        |79   |
|Plumber                             |77   |
|Bookkeeping Accounting & Audit Clerk|76   |
+------------------------------------+-----+

可以看到查询结果的输出。可惜spark没有输出查询时间,只能靠感觉。性能跟drill差不多,也是很快的。

综上所述,在这个大内存的年代,普遍都是在内存里运行查询引擎。这类引擎太多了,有spark, flink, ignite, impala, drill等等。它们在性能上要远优于过去依赖于分布式文件系统的mr查询。哪怕是做离线数仓,Hive也并非最佳选择,因为它实在太慢了。

Source URL: https://smart.postno.de/archives/3715