在过去Hadoop时代,Hive几乎是默认的Hadoop数据的查询引擎。当然Hive主要做离线查询使用,它是针对数据规模设计的,对交互式查询的能力没考虑那么多。
如果使用Hadoop默认的mr引擎,Hive实在太慢了。我这里用Hive和另一款类似的SQL引擎,Apache Drill做一个对比。
测试数据来自这里,是一个csv文件,包含10000条个人数据。共有5栏,分别是姓名、生日、性别、电话、工作。这是一份程序生成的fake数据。
Hive加载csv复杂一些,先要将数据导入到HDFS,然后再创建table,再load数据到table里。
将数据上传到本机部署的HDFS:
$ hdfs dfs -put people.csv /tmp/test/
登陆Hive shell,在Hive里创建一个table:
create table ppl (
name string,
born date,
sex string,
tel int,
job string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
上述创建了名为ppl的table,包含了5列:name, born, sex, tel, job。同时指定了列的类型。
因为csv是用逗号分隔的,所以指定一个属性:FIELDS TERMINATED BY ‘,’
接下来,使用如下语句,从HDFS加载csv文件到table:
LOAD DATA INPATH '/tmp/test/people.csv' OVERWRITE INTO TABLE ppl;
table数据有了,就可以运行sql查询了。我在beeline里使用的测试sql语句如下:
0: jdbc:hive2://localhost:10000/default> select job,count(*) as dd from ppl group by job order by dd desc;
+---------------------------------------+------+
| job | dd |
+---------------------------------------+------+
| Market Research Analyst | 122 |
| Patrol Officer | 122 |
| Administrative Assistant | 120 |
| Pharmacy Technician | 119 |
| Paramedic | 117 |
| Personal Care Aide | 116 |
| Sales Manager | 115 |
| Restaurant Cook | 114 |
| Financial Manager | 113 |
| Mechanical Engineer | 113 |
| Clinical Social Worker | 112 |
| Loan Officer | 112 |
| Substance Abuse Counselor | 112 |
| Speech-Language Pathologist | 112 |
| High School Teacher | 112 |
| Epidemiologist | 111 |
| Cost Estimator | 111 |
| Financial Analyst | 111 |
| Medical Secretary | 110 |
| Dentist | 110 |
| Clinical Laboratory Technician | 110 |
| Recreation & Fitness Worker | 109 |
| Insurance Agent | 109 |
| Medical Assistant | 109 |
| Public Relations Specialist | 108 |
| Marriage & Family Therapist | 108 |
| Database Administrator | 107 |
| Architect | 107 |
| Pharmacist | 107 |
| Glazier | 106 |
| Diagnostic Medical Sonographer | 105 |
| Compliance Officer | 105 |
| Computer Systems Analyst | 105 |
| Auto Mechanic | 105 |
| Telemarketer | 104 |
| Hairdresser | 104 |
| Logistician | 104 |
| Physical Therapist | 103 |
| Meeting Convention & Event Planner | 103 |
| Carpenter | 103 |
| Sports Coach | 102 |
| Security Guard | 102 |
| Bus Driver | 102 |
| Dental Assistant | 102 |
| Sales Representative | 102 |
| Dental Hygienist | 101 |
| School Psychologist | 101 |
| Preschool Teacher | 101 |
| Paralegal | 101 |
| Elementary School Teacher | 101 |
| Maintenance & Repair Worker | 100 |
| Fabricator | 100 |
| Veterinarian | 100 |
| Painter | 100 |
| Landscaper & Groundskeeper | 100 |
| Marketing Manager | 99 |
| Middle School Teacher | 99 |
| School Counselor | 99 |
| Home Health Aide | 99 |
| Executive Assistant | 99 |
| Construction Worker | 99 |
| Software Developer | 98 |
| Brickmason & Blockmason | 98 |
| Physical Therapist Assistant | 98 |
| Art Director | 97 |
| Construction Manager | 97 |
| Customer Service Representative | 97 |
| Respiratory Therapist | 96 |
| Veterinary Technologist & Technician | 96 |
| Massage Therapist | 96 |
| Radiologic Technologist | 95 |
| Registered Nurse | 95 |
| Civil Engineer | 94 |
| Accountant | 93 |
| Real Estate Agent | 93 |
| Janitor | 92 |
| Financial Advisor | 92 |
| Occupational Therapist | 92 |
| Delivery Truck Driver | 92 |
| Electrician | 91 |
| Child & Family Social Worker | 90 |
| Management Analyst | 90 |
| Nursing Aide | 90 |
| Teacher Assistant | 89 |
| Business Operations Manager | 89 |
| Web Developer | 87 |
| Cement Mason & Concrete Finisher | 86 |
| Mental Health Counselor | 86 |
| Maid & Housekeeper | 85 |
| Physician | 84 |
| Cashier | 84 |
| HR Specialist | 83 |
| IT Manager | 83 |
| Interpreter & Translator | 82 |
| Lawyer | 82 |
| Computer Systems Administrator | 81 |
| Computer Programmer | 81 |
| Receptionist | 79 |
| Plumber | 77 |
| Bookkeeping Accounting & Audit Clerk | 76 |
+---------------------------------------+------+
100 rows selected (2.66 seconds)
如上,可以看到清晰的查询结果,以及查询耗时:2.66秒。
再把数据导入到drill进行测试。这个十分简单,因为drill不需要定义schema,也就不需要创建表的过程。并且drill可以直接加载本地文件,无需依赖HDFS。
运行drill-embedded打开本机的一个drill实例,运行如下查询:
apache drill> select columns[4] as job, count(*) as dd from dfs.`/home/xxx/tmp/people.csv` group by job order by dd desc;
+--------------------------------------+-----+
| job | dd |
+--------------------------------------+-----+
| Market Research Analyst | 122 |
| Patrol Officer | 122 |
| Administrative Assistant | 120 |
| Pharmacy Technician | 119 |
| Paramedic | 117 |
| Personal Care Aide | 116 |
| Sales Manager | 115 |
| Restaurant Cook | 114 |
| Mechanical Engineer | 113 |
| Financial Manager | 113 |
| Substance Abuse Counselor | 112 |
| High School Teacher | 112 |
| Loan Officer | 112 |
| Clinical Social Worker | 112 |
| Speech-Language Pathologist | 112 |
| Cost Estimator | 111 |
| Financial Analyst | 111 |
| Epidemiologist | 111 |
| Clinical Laboratory Technician | 110 |
| Medical Secretary | 110 |
| Dentist | 110 |
| Medical Assistant | 109 |
| Recreation & Fitness Worker | 109 |
| Insurance Agent | 109 |
| Public Relations Specialist | 108 |
| Marriage & Family Therapist | 108 |
| Architect | 107 |
| Pharmacist | 107 |
| Database Administrator | 107 |
| Glazier | 106 |
| Auto Mechanic | 105 |
| Computer Systems Analyst | 105 |
| Compliance Officer | 105 |
| Diagnostic Medical Sonographer | 105 |
| Hairdresser | 104 |
| Telemarketer | 104 |
| Logistician | 104 |
| Meeting Convention & Event Planner | 103 |
| Physical Therapist | 103 |
| Carpenter | 103 |
| Bus Driver | 102 |
| Sales Representative | 102 |
| Dental Assistant | 102 |
| Security Guard | 102 |
| Sports Coach | 102 |
| Elementary School Teacher | 101 |
| Dental Hygienist | 101 |
| Preschool Teacher | 101 |
| Paralegal | 101 |
| School Psychologist | 101 |
| Landscaper & Groundskeeper | 100 |
| Painter | 100 |
| Maintenance & Repair Worker | 100 |
| Fabricator | 100 |
| Veterinarian | 100 |
| Middle School Teacher | 99 |
| School Counselor | 99 |
| Marketing Manager | 99 |
| Construction Worker | 99 |
| Home Health Aide | 99 |
| Executive Assistant | 99 |
| Physical Therapist Assistant | 98 |
| Brickmason & Blockmason | 98 |
| Software Developer | 98 |
| Art Director | 97 |
| Customer Service Representative | 97 |
| Construction Manager | 97 |
| Respiratory Therapist | 96 |
| Massage Therapist | 96 |
| Veterinary Technologist & Technician | 96 |
| Registered Nurse | 95 |
| Radiologic Technologist | 95 |
| Civil Engineer | 94 |
| Accountant | 93 |
| Real Estate Agent | 93 |
| Financial Advisor | 92 |
| Janitor | 92 |
| Delivery Truck Driver | 92 |
| Occupational Therapist | 92 |
| Electrician | 91 |
| Child & Family Social Worker | 90 |
| Nursing Aide | 90 |
| Management Analyst | 90 |
| Business Operations Manager | 89 |
| Teacher Assistant | 89 |
| Web Developer | 87 |
| Mental Health Counselor | 86 |
| Cement Mason & Concrete Finisher | 86 |
| Maid & Housekeeper | 85 |
| Physician | 84 |
| Cashier | 84 |
| HR Specialist | 83 |
| IT Manager | 83 |
| Interpreter & Translator | 82 |
| Lawyer | 82 |
| Computer Systems Administrator | 81 |
| Computer Programmer | 81 |
| Receptionist | 79 |
| Plumber | 77 |
| Bookkeeping Accounting & Audit Clerk | 76 |
+--------------------------------------+-----+
100 rows selected (0.204 seconds)
我们也可以看到清晰的查询结果,以及查询耗时:0.204秒。
这个简单对比,说明Hive实在太慢了。是Apache drill的十倍之慢。
与此同时,再对比下spark的查询性能。打开spark-shell,首先加载文件到spark:
scala> val df = spark.read.format("csv").option("inferSchema", "true").option("header", "false").load("/home/xxx/tmp/people.csv");
接着运行如下查询:
scala> df.groupBy("_c4").count.orderBy(desc("count")).show(100,false)
+------------------------------------+-----+
|_c4 |count|
+------------------------------------+-----+
|Patrol Officer |122 |
|Market Research Analyst |122 |
|Administrative Assistant |120 |
|Pharmacy Technician |119 |
|Paramedic |117 |
|Personal Care Aide |116 |
|Sales Manager |115 |
|Restaurant Cook |114 |
|Financial Manager |113 |
|Mechanical Engineer |113 |
|Loan Officer |112 |
|High School Teacher |112 |
|Clinical Social Worker |112 |
|Substance Abuse Counselor |112 |
|Speech-Language Pathologist |112 |
|Epidemiologist |111 |
|Cost Estimator |111 |
|Financial Analyst |111 |
|Medical Secretary |110 |
|Clinical Laboratory Technician |110 |
|Dentist |110 |
|Medical Assistant |109 |
|Insurance Agent |109 |
|Recreation & Fitness Worker |109 |
|Public Relations Specialist |108 |
|Marriage & Family Therapist |108 |
|Architect |107 |
|Database Administrator |107 |
|Pharmacist |107 |
|Glazier |106 |
|Computer Systems Analyst |105 |
|Diagnostic Medical Sonographer |105 |
|Auto Mechanic |105 |
|Compliance Officer |105 |
|Hairdresser |104 |
|Logistician |104 |
|Telemarketer |104 |
|Carpenter |103 |
|Meeting Convention & Event Planner |103 |
|Physical Therapist |103 |
|Security Guard |102 |
|Bus Driver |102 |
|Dental Assistant |102 |
|Sports Coach |102 |
|Sales Representative |102 |
|School Psychologist |101 |
|Dental Hygienist |101 |
|Elementary School Teacher |101 |
|Paralegal |101 |
|Preschool Teacher |101 |
|Painter |100 |
|Veterinarian |100 |
|Fabricator |100 |
|Maintenance & Repair Worker |100 |
|Landscaper & Groundskeeper |100 |
|Marketing Manager |99 |
|School Counselor |99 |
|Executive Assistant |99 |
|Construction Worker |99 |
|Home Health Aide |99 |
|Middle School Teacher |99 |
|Brickmason & Blockmason |98 |
|Physical Therapist Assistant |98 |
|Software Developer |98 |
|Customer Service Representative |97 |
|Art Director |97 |
|Construction Manager |97 |
|Massage Therapist |96 |
|Veterinary Technologist & Technician|96 |
|Respiratory Therapist |96 |
|Radiologic Technologist |95 |
|Registered Nurse |95 |
|Civil Engineer |94 |
|Real Estate Agent |93 |
|Accountant |93 |
|Janitor |92 |
|Financial Advisor |92 |
|Delivery Truck Driver |92 |
|Occupational Therapist |92 |
|Electrician |91 |
|Management Analyst |90 |
|Nursing Aide |90 |
|Child & Family Social Worker |90 |
|Business Operations Manager |89 |
|Teacher Assistant |89 |
|Web Developer |87 |
|Cement Mason & Concrete Finisher |86 |
|Mental Health Counselor |86 |
|Maid & Housekeeper |85 |
|Physician |84 |
|Cashier |84 |
|IT Manager |83 |
|HR Specialist |83 |
|Interpreter & Translator |82 |
|Lawyer |82 |
|Computer Programmer |81 |
|Computer Systems Administrator |81 |
|Receptionist |79 |
|Plumber |77 |
|Bookkeeping Accounting & Audit Clerk|76 |
+------------------------------------+-----+
可以看到查询结果的输出。可惜spark没有输出查询时间,只能靠感觉。性能跟drill差不多,也是很快的。
综上所述,在这个大内存的年代,普遍都是在内存里运行查询引擎。这类引擎太多了,有spark, flink, ignite, impala, drill等等。它们在性能上要远优于过去依赖于分布式文件系统的mr查询。哪怕是做离线数仓,Hive也并非最佳选择,因为它实在太慢了。