1. Group by代替 count(distinct)的原因
当要统计某一列的去重数时,count(distinct)会非常慢。因为count(distinct)逻辑只会用很少的reducer来处理。此时可以用group by来改写:
-- 原始sql
select count(distinct age)
from demo;
-- 优化后
select count(1)
from (select id
from demo
group by id) tmp;
2. 对比
2.1 Distinct
select count (distinct age) from demo;
2.2 Group By
select count(1)
from (select age
from demo
group by age) tmp;
3. 查看hive执行计划的区别
3.1 Distinct
Explain
STAGE DEPENDENCIES:
Stage-1 is a root stage // 根stage
Stage-0 depends on stages: Stage-1 // stage1 执行完后执行stage-0
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan // 表扫描
alias: demo // 表名
Statistics: Num rows: 126172096 Data size: 504688384 Basic stats: COMPLETE Column stats: NONE // 表统计信息、包含数据条数、数据大小
Select Operator // 选取操作
expressions: age (type: int) // 需要的字段名以及字段的类型
outputColumnNames: age // 输出列的名称
Statistics: Num rows: 126172096 Data size: 504688384 Basic stats: COMPLETE Column stats: NONE
Group By Operator // 分组聚合操作
aggregations: count(DISTINCT age) // 聚合函数
keys: age (type: int) // 分组的字段
mode: hash // 聚合模式,值有 hash:随机聚合,就是hash partition;partial:局部聚合;final:最终聚合
outputColumnNames: _col0, _col1 // 聚合之后输出列名
Statistics: Num rows: 126172096 Data size: 504688384 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: int)
sort order: + // order:值为空 不排序;值为 + 正序排序,值为 - 倒序排序;值为 +- 排序的列为两列,第一列为正序,第二列为倒序
Statistics: Num rows: 126172096 Data size: 504688384 Basic stats: COMPLETE Column stats: NONE
Reduce Operator Tree:
Group By Operator
aggregations: count(DISTINCT KEY._col0:0._col0)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 16 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 16 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Time taken: 0.131 seconds, Fetched: 45 row(s)
3.2 Group By
Explain
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1
Stage-0 depends on stages: Stage-2
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: demo
Statistics: Num rows: 126172096 Data size: 504688384 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: age (type: int)
outputColumnNames: age
Statistics: Num rows: 126172096 Data size: 504688384 Basic stats: COMPLETE Column stats: NONE
Group By Operator
keys: age (type: int)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 126172096 Data size: 504688384 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: int)
sort order: +
Map-reduce partition columns: _col0 (type: int)
Statistics: Num rows: 126172096 Data size: 504688384 Basic stats: COMPLETE Column stats: NONE
Reduce Operator Tree:
Group By Operator
keys: KEY._col0 (type: int)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 63086048 Data size: 252344192 Basic stats: COMPLETE Column stats: NONE
Select Operator
Statistics: Num rows: 63086048 Data size: 252344192 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: count(1)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
Stage: Stage-2
Map Reduce
Map Operator Tree:
TableScan
Reduce Output Operator
sort order:
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
value expressions: _col0 (type: bigint)
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Time taken: 0.163 seconds, Fetched: 74 row(s)
从执行计划的区别可以看出Group by的写法会启动两个MR Job、distinct只会启动一个。所以要确保数据量大到启动job的延迟远小于计算耗时,才考虑这种方法。当数据量很小或者key的倾斜比较明显时,group by可能比distinct慢