2021-02-02 HiveSQL优化技巧

本文只是针对数据分析师常用到数据查询时的调优技巧。

调优小技巧.png

技巧1:列裁剪和分区裁剪(只选择我们用到的数据)
（1）列裁剪：一般我们常见的数据宽表，可能有很多字段，几百个。select后只写需要查询的字段，用不到的字段不要写。因此不要用select * 。
（2）分区裁剪：只查询我们需要的分区内的数据。

技巧2:排序技巧（使用sort by+ distribute by代替order by ）
(1)order by是对排序字段的全局排序，会导致所有map端数据进入一个reduce中，数据量大时计算比较耗时。
(2)如果使用sort by ，会视情况启动多个reduce进行排序，提升了执行效率。
(3)一般情况下，sort by还要配合distribute by一同使用，这样map端的数据，会平均分配到到reduce中，否则，分配是随机的。
⚠️：(1)窗口函数中的order by 不能用sort by 代替。(2)适用于数据量特别大的时候（至少超过千万级的数据），数据量小的时候，优势不明显。
优化前：

select user_id,user_name
from user_info
order by user_id
limit 100;

优化后：

select user_id,user_name
from user_info
distribute by user_id
sort by user_id
limit 100;

技巧3:去重技巧(使用group by替换distinct)
尽量减少对单字段使用distinct去重：distinct会把计算放在一个服务器上；而group by会把计算分配到多个服务器并行操作，这里不会因为并行操作出现重复。
案例：某电商平台有一张用户交易记录表trade(大概有上亿条数据)，现在要查询2020年一整年的交易用户数。
优化前：

select count(distinct user_id)
from trade
where year(dt)=2020
group by user_id；

优化后：

select count(a.user_id) 
from
(select user_id
from trade
where year(dt)=2020
group by user_id) a;

技巧4:聚合技巧(MySQL同样适用)
(1)grouping sets
做用户画像时，我们会获取用户的很多属性（性别、学历、年龄区间、地域等等），然后进行统计。有多少个字段，就要写多少个sql。这种情况下，我们如果使用grouping sets可以只写一段代码就解决了。

select  sex,education,age,city,count(user_id)
from user_info
group by  sex,education,age,city
grouping sets( sex,education,age,city)；

结果.png

从输出的结果我们可以看出，grouping sets 的结果和union all执行的结果相同（空的用null填充），但是执行效率比union all高。
对哪些字段或者字段组合进行统计，都在grouping sets()的括号内进行设置，如(sex,(sex,city))。

select sex, city, count(user_id)
from user_info
group by sex, city
grouping sets(sex,(sex,city));

（2）cube（对group by 维度的所有组合进行聚合）

select sex, education,city, count(user_id)
from user_info
group by sex,education,city
with cube;

执行结果等同于：

select sex, education,city, count(user_id)
from user_info
group by sex,education,city
grouping sets(sex,education,city,(sex,education),(sex,city),(education,city),(sex,education,city))；

⚠️：跑完数据后的整理工作很关键。如果不是必须要所有组合或几乎所有组合的数据，不建议使用。
(3)roll_up
以最左侧的维度为主，进行层级聚合，是cube的子级。主要的场景，对于时间不同层级的聚合，如，同时计算出年份、季度、月度不同层级的支付金额。

select year(dt)  as year,
       quarter(dt) as quater,
       month(dt) as month,
       sum(pay_amount)
from trade
where dt>'0'
group by year(dt), quarter(dt),month(dt)
with rollup;

结果.jpeg

技巧5::union all时可以开启并发执行
开启参数设置：set hive.exec.parallel=true;（查看：set hive.exec.parallel; 关闭：set hive.exec.parallel=false;）。

技巧6:表的连接优化
(1)inner join时，小表在前，大表在后。(left join要考虑业务逻辑，不适用）;
(2)多个表（>=3）连接时，尽量使用相同的连接键（on后边的连接键，如果都使用相同的连接键，会只产生一个MapReduce job，否则会产生多个）;
(3)尽早的过滤数据（一般连接之前会对数据进行充分过滤，尽量减少每个阶段的数据量）。(4)有时候通过表的自连接进行查询，一般来说分区表数据量都比较大且增加快，因此通过两次查询然后通过表的自连接获取需要的数据，效率会比较受限，可以考虑其他方式。

技巧7:遵循严格模式
开启严格模式的参数设置：hive.mapred.mode=strict;(关闭：hive.mapred.mode=nostrict;查看：hive.mapred.mode;)
如果开启了严格模式(大部分公司都会选择开启)，执行以下sql会报错：
(1)查询分区表的时候，不限定分区列;
(2)量表产生了笛卡尔积的语句;
(3)order by进行排序没有指定limit。

2021-02-02 HiveSQL优化技巧

推荐阅读更多精彩内容