HIVE常用语句强记

一场pandas与SQL的巅峰大战（四）
一场pandas与SQL的巅峰大战（六）

日志表/离线表（不准是正常的，可能因为系统切换/关机重启等导致异常），而t_tznew_date_user_uv是缓存表（准的）。

排序

【ROW_NUMBER() 】会根据顺序计算。
【DENSE_RANK()】排序相同时会重复，总数会减少。例如1223
【RANK() =sumproduct：造成少发钱】排序相同时会重复，总数不会变。例如1334
EXCEL中的sumproduct函数即此。
【EXCEL中的rank函数：造成多发钱】例如1224

EXCEL中的rank函数

字符串

【1】两者都是替代函数，但HIVE中有translate，而HIVE中无replace函数；
【2】replace针对的是字符串，而translate针对的是单个字符。

时间的不同表示替换方法
SELECT translate(CAST(stat_date as STRING),'-','') as dt

日期

10位和13位时间戳分别都是怎么产生的：
10位时间戳是把时间精确到秒级；
13位时间戳是把时间精确到毫秒级，所以两者是1000倍的关系；

今日：
SELECT CURRENT_DATE

上周的今日：
select date_sub(CURRENT_DATE,7)

字符串变换20201115转换成2020-11-15、字符串变日期（to_date()）、并获取周数：
【1：concat_ws】weekofyear(to_date(concat_ws('-',substr(dt,1,4),substr(dt,5,2),substr(dt,7,2))))`weekofyear`
【2：concat】concat('19',substr(idcard,7,2),
                            '-', substr(idcard,9,2),
                            '-', substr(idcard,11,2)
                            ) 

日期>>>>时间戳
select unix_timestamp()   --1565858389
日期>>>>时间戳   【不需要加单引号】
select 1607434525,'1607434525'
,unix_timestamp(),from_unixtime(unix_timestamp(),'yyyy-MM-dd'),from_unixtime(1607434525,'yyyy-MM-dd')

将20190410这种格式的日期转换为2019-04-10的格式【先将ord_dt转换为Unix时间戳，再将时间戳秒数转换为指定格式的日期】
select from_unixtime(unix_timestamp('20190410','yyyymmdd'),'yyyy-mm-dd');
select from_unixtime(unix_timestamp('2019-04-10','yyyy-mm-dd'),'yyyymmdd');

时间戳>>>>日期
select from_unixtime(unix_timestamp(),'yyyy-MM-dd'),from_unixtime(unix_timestamp(),'yyyyMMdd')

取当前天的下一个周一：
select next_day('2019-12-12','MO');

取当前周的周一：
select date_add(next_day('2019-12-12','MO'),-7);

其他

【下面的写法是对的，否则union all (select ...)aa】

不去重的union all：
select a.id,a.name from a
union all
select b.sid,b.sname from b

不去重的union all【下面的写法是对的，否则union all (select ...)aa】：
坑1：
            select dt,uid,success_uv from hdp_58_ubu_sjmobile_defaultdb.month10
            group by dt,uid,success_uv
            union ALL
            SELECT  dt,uid,(uv + new_uv) AS success_uv
            FROM  hdp_ubu_tech_wei_defaultdb.t_tznew_date_user_uv
            where dt>='20201101' and dt<='20201116'  --此处我出错了：dt>='20201101' 而非dt>='20201001'
            group by dt,uid,(uv + new_uv)
坑2：改变success_uv的字段类型（int与str，会强行合并成str，造成与int比较报错！）


显示百分比形式【保留小数点后四位 * 100，连接%】
concat(round(x/y, 4) * 100, '%')

避免重复值的2种方法：
【1】select  distinct user_id
【2】select  user_id...group by user_id

【case when的用法】
case when length(idcard) = 18 then
            case when substr(idcard,17,1)%2 = 0 then 'F' 
                 when substr(idcard,17,1)%2 <> 0 then 'M'
                 else null end
       when length(idcard) = 15 then 
            case when substr(idcard,15)%2 = 0 then 'F' 
                 when substr(idcard,15)%2 <> 0 then 'M'
                 else null end 
       else null end  as sex

【HIVE不支持非等值连接，但可以通过locate()函数进行功能转换】Both left and right aliases encountered in JOIN 'pipei'
select * from aa
    left join 
    hdp_58_ubu_sjmobile_defaultdb.ceshi cc
    on 1=1
    where concat(bb.city1_name,bb.city2_name,bb.city3_name) >= cc.pipei)dd
-------此为true------ concat(bb.city1_name,bb.city2_name,bb.city3_name) >= cc.pipei

    left join 
    hdp_58_ubu_sjmobile_defaultdb.ceshi cc
    ON bb.city1_name=cc.province  
    -- 不用on(True),因为需要设置 set hive.mapred.mode=nonstrict，而云窗貌似不支持？
    where locate(cc.pipei,concat(bb.city1_name,bb.city2_name,bb.city3_name))>0

Hive的不等值连接
 JouyPub重要博客
 工作中常见的hive语句总结

HIVE常用语句强记

排序

字符串

日期

其他

推荐阅读更多精彩内容