hive作业笔记

#能把列名删除(就是第一行)
sed '1d' orders.csv 

# 查看有哪些字段(维度)
$ desc orders.csv
输出:
order_id                string              
user_id                 string              
eval_set                string              
order_number            string              
order_dow               string              
order_hour_of_day       string              
days_since_prior_order  string   

create table orders 
(
order_id string,
user_id string,
dow string
) row format delimited fields terminated by ',' 
lines terminated by '\n';

2.每个用户有多少个订单[orders] pv浏览量 > uv用户量

select user_id, count(order_id) as order_cnt
from orders
group by user_id
order by order_cnt
limit 100; 

输出:
202272  100
133983  100
172806  100
69995   100
162455  100
67798   100
82420   100
35190   100
67736   100
117890  100

3.每个用户【2.平均【1.每个订单是多少商品】】 avg

分析:
$ desc orders 
order_id                string              
user_id                 string              
eval_set                string              
order_number            string              
order_dow               string              
order_hour_of_day       string              
days_since_prior_order  string  

$desc desc order_product_prior;
order_id                string              
product_id              string        订单里面的商品数      
add_to_cart             string              
reordered               string    

$ select * from order_product_prior limit 10;
order_id        product_id      add_to_cart_order       reordered
5       18569   19      1
5       41176   20      1
5       48366   21      1
5       47209   22      0
5       46522   23      0
5       38693   24      0
5       48825   25      0
5       8479    26      0

Hive SQL :
select user_id, avg(prod_cnt) as avg_prod_cnt
from (
select orders.user_id, prod.prod_cnt from orders
join (
select order_id, count(product_id) as prod_cnt
from order_product_prior
group by order_id) prod
on orders.order_id=prod.order_id
) t
group by user_id
order by avg_prod_cnt desc
limit 100;

kill 子任务:
① kill job: hadoop job -kill job_1553939296911_0006
②kill application:
yarn application -kill application_1537547243219_0056
  1. 每个用户在一周中的购买订单的分布 --列转行
    if 语句
    case when 语句
  1. 每个用户平均每个购买天中,购买的商品数量
    days_since_prior_order 这个字段当做:订单里面的商品数

select
if(days_since_prior_order='','0',days_since_prior_order)
as days_since_prior_order

select user_id, count(distinct days_since_prior_order) as day_cnt
from orders
group by user_id

每个订单是有多少商品
select order_id, count(product_id) as prod_cnt
from order_products_prior
group by order_id

  1. 每个用户最喜欢的3个商品是什么
    ①表join关联
    select user_id, product_id from orders join order_product_prior pri on orders.order_id=pri.order_id limit 100;
select user_id, concat_ws(',', collect_list(order_id)) as order_ids from orders group by user_id limit 100;
1       2539329,2398795,473747,2254736,431534,3367565,550135,3108588,2295261,2550362,1187899
10      1224907,68288,2115522,83395,1353310,1822501
100     680467,3159209,2443738,2337051,2875733,3302990
1000    3045856,1456368,2700013,2003258,568709,18216,2885089,903956
10000   1492646,1444722,3272690,2152976,859342,103998,3146859,2306339,1765454,377814,1090401,3204347,2077534,1731354,1191621,428409,936532,2300676,236049,715590,2196356,192826,2652468,2600064,12156,1550736,793788,1824260,1044054,2505014,1104652,2646743,2166278,247668,545893,1902338,2327642,2610522,3038094,2856130,675149,1762088,2662079,775362,956424,3007165,858276,957745,664258,2492781,686954,3128885,1946074,611183,3079937,193643,1788987,2864915,842149,1404909,2477429,2999447,2511241,3060036,3268994,584896,343340,3306425,1125006,951827,1990815,2989046,1925217
100000  507791,284691,1016379,1666305,2380440,2159051,1844720,2433517,2104152,576286



select user_id, collect_list(order_id) as order_ids from orders group by user_id limit 100;
1       ["2539329","2398795","473747","2254736","431534","3367565","550135","3108588","2295261","2550362","1187899"]
10      ["1224907","68288","2115522","83395","1353310","1822501"]
100     ["680467","3159209","2443738","2337051","2875733","3302990"]
1000    ["3045856","1456368","2700013","2003258","568709","18216","2885089","903956"]
10000   ["1492646","1444722","3272690","2152976","859342","103998","3146859","2306339","1765454","377814","1090401","3204347","2077534","1731354","1191621","428409","936532","2300676","236049","715590","2196356","192826","2652468","2600064","12156","1550736","793788","1824260","1044054","2505014","1104652","2646743","2166278","247668","545893","1902338","2327642","2610522","3038094","2856130","675149","1762088","2662079","775362","956424","3007165","858276","957745","664258","2492781","686954","3128885","1946074","611183","3079937","193643","1788987","2864915","842149","1404909","2477429","2999447","2511241","3060036","3268994","584896","343340","3306425","1125006","951827","1990815","2989046","1925217"]
100000  ["507791","284691","1016379","1666305","2380440","2159051","1844720","2433517","2104152","576286"]



我的思考:

  1. hive中的数据如何导入到redis中?

回答:

  1. hive中的数据如何导入到redis中?

待补充。
https://www.fuwuqizhijia.com/redis/201704/67005.html
https://blog.csdn.net/DSLZTX/article/details/50775500

use test;
drop table if exists student2;
create table student2(
id int,
name string,
age int,
course array<string>,
body map<string,float>,
address struct<street:string,number:int>
)
partitioned by (state string,city string)
row format delimited
fields terminated by '\t'
collection items terminated by ','
map keys terminated by ':'
lines terminated by '\n'
stored as textfile;

--1. 将orders和order_products_prior建表入hive
--sed '1d' orders.csv 能把列名删除
create table orders
(
order_id string,
user_id string,
eval_set string,
order_number string,
order_dow string,
order_hour_of_day string,
days_since_prior_order string
)
row format delimited fields terminated by ',' --'\t'
lines terminated by '\n';
--location '/data/orders' --(orders是文件夹)

--导入数据
load local data inpath '/home/badou/Documents/data/order_data/orders.csv'
overwrite into table orders;

order_id string
product_id string
add_to_cart_order string
reordered string

--2.每个用户有多少个订单[orders] pv浏览量 > uv用户量
--order_id,user_id
select user_id,count(order_id) as order_cnt
from orders
group by user_id
order by order_cnt desc
limit 10

--3.每个用户【2.平均【1.每个订单是多少商品】】 avg
orders[用户,订单] order_products_prior【订单,商品】order_id product_id
--3.1.每个订单是多少商品
select order_id,count(product_id) as prod_cnt
from order_products_prior
group by order_id

--3.2
set hive.cli.print.header=true;
select orders.user_id,prod.prod_cnt
from orders
join(
select order_id,count(product_id) as prod_cnt
from order_products_prior
group by order_id)prod
on orders.order_id=prod.order_id
limit 10;

三张小票

  1. 10个prod
  2. 20个prod
  3. 30个prod

(10+20+30)/3 avg

user_id avg([10,20,30])=20

1user 100order sum 1*100prod avg 1
2user 2order sum 10+20prod avg 15

select user_id, avg(prod_cnt) as avg_prod_cnt
from
(
select orders.user_id,prod.prod_cnt
from orders
join(
select order_id,count(product_id) as prod_cnt
from order_products_prior
group by order_id)prod
on orders.order_id=prod.order_id
)t
group by user_id
order by avg_prod_cnt desc
limit 100

都是kill子任务:
-- kill job:
hadoop job -kill job_1537547243219_0058
-- kill application:
yarn application -kill application_1537547243219_0056

select orders.user_id,avg(prod.prod_cnt) as avg_prod_cnt
from orders
join(
select order_id,count(product_id) as prod_cnt
from order_products_prior
group by order_id)prod
on orders.order_id=prod.order_id
group by orders.user_id
limit 10;

--4. 每个用户在一周中的购买订单的分布 --列转行
set hive.cli.print.header=true;
select
user_id,
sum(case order_dow when '0' then 1 else 0 end) as dow_0,
sum(case order_dow when '1' then 1 else 0 end) as dow_1,
sum(case order_dow when '2' then 1 else 0 end) as dow_2,
sum(case order_dow when '3' then 1 else 0 end) as dow_3,
sum(case order_dow when '4' then 1 else 0 end) as dow_4,
sum(case order_dow when '5' then 1 else 0 end) as dow_5,
sum(case order_dow when '6' then 1 else 0 end) as dow_6
from orders
group by user_id
limit 20;

--5. 每个用户平均每个购买天中,购买的商品数量
--【把days_since_prior_order当做一个日期20181014】
--1.orders user_id days_since_prior_order 天数
select if(days_since_prior_order='','0',days_since_prior_order) as days_since_prior_order

select user_id,count(distinct days_since_prior_order) as day_cnt
from orders
group by user_id

--2.每个订单是多少商品
select order_id,count(product_id) as prod_cnt
from order_products_prior
group by order_id

1.user_id prod_cnt day_cnt join orders order_products_prior
2.user_id prod_cnt dt(days_since_prior_order) join

--1.
select orders.user_id,sum(prod.prod_cnt)/count(distinct days_since_prior_order) as day_cnt
from orders
join
(select order_id,count(product_id) as prod_cnt
from order_products_prior
group by order_id)prod
on orders.order_id=prod.order_id
group by orders.user_id
limit 100

--2.
select user_id,avg(prod_cnt) as avg_day_prod_cnt
from(
select orders.user_id,orders.days_since_prior_order,sum(prod.prod_cnt) as prod_cnt
from orders
join
(select order_id,count(product_id) as prod_cnt
from order_products_prior
group by order_id)prod
on orders.order_id=prod.order_id
group by orders.user_id,orders.days_since_prior_order
)t
group by user_id
limit 100

最终两个结果一样,执行逻辑一样

  1. 每个用户最喜爱购买的三个product是什么,最终表结构可以是3个列,或者一个字符串
    user_id,product_id,prod_buy_cnt 一个用户对同一个商品购买多少次

select user_id,collect_list(concat_ws('_',product_id,cast(row_num as string))) as top_3_prods
from(
select user_id,product_id,
row_number() over(partition by user_id order by prod_buy_cnt desc) as row_num,
prod_buy_cnt
from(
select orders.user_id,pri.product_id,count(1) as prod_buy_cnt
from orders
join order_products_prior pri
on orders.order_id=pri.order_id
group by orders.user_id,pri.product_id
)t
)tt
where row_num<=3
group by user_id
limit 10;

--concat_ws
select user_id,collect_list(order_id) as order_ids
from orders
group by user_id
limit 10;

orders
app, m, pc
orders_app
orders_m
orders_pc
group by == word count
1.combiner
2.set hive.groupby.skewindata=true;
一个map reduce拆成两个MR

--多个mapreduce
a.order_id = b.ord_id
c.product_id = b.prod_id

user_id product_desc

--一个mapreduce
a.order_id = b.ord_id
a.order_id = c.ord_id
partition

a:orders订单表 user_id order_id
b: 订单对应的商品表 order_id 产品
c: 订单类型表 order_id 类型,在哪些广告渠道过来的下的订单
loading page

--一个 MR job
select * from orders
join order_products_prior pri on orders.order_id=pri.order_id
join tmp_order tord on orders.order_id=tord.order_id

--

select * from orders
join order_products_prior pri on orders.order_id=pri.order_id
join tmp_order tord on pri.order_id=tord.order_id

dict ={"order_id":"product_id"}

order_id ,user_id
a: user_id, product_id
for line in sys.stdin:
s = line.split()
user_id=s[0]
order_id = s[1]
product_id = dict.get(order_id)
print('%s\t%s\t%s'%(user_id,order_id,product_id))

a b
1,
2,
3
hue cdh

  1. [4,5[7,8,9[11,21,23]],6]

val s = "The figure of Irene, never, as the reader may possibly have observed, present, except through the senses of other characters, is a concretion of disturbing Beauty impinging on a possessive world."
s.split(" ")
s.split(" ").length

import scala.io.Source
val lines = Source.fromFile("./The_man_of_property.txt").getLines().toList
lines: List[String]

lines.map(x=>x.split(" "))
res0: List[Array[String]]

lines.map(x=>x.split(" ")).flatten
res2: List[String]
<=>
lines.flatMap(x=>x.split(" "))
res2: List[String]

MR的map处理:

lines.flatMap(.split(" ")).map(x=>(x,1))
lines.flatMap(
.split(" ")).map((_,1))

lines.flatMap(.split(" ")).map(x=>(x,1)).groupBy(x=>x.1)
lines.flatMap(
.split(" ")).map(x=>(x,1)).groupBy(
._1)
scala.collection.immutable.Map[String,List[(String, Int)]]
key->value
forgotten -> List((forgotten,1), (forgotten,1), (forgotten,1), (forgotten,1), (forgotten,1), (forgotten,1)

MR word count:
lines.flatMap(.split(" ")).map(x=>(x,1)).groupBy(._1).map(x=>(x._1,x._2.length))

lines.flatMap(.split(" "))
.map(x=>(x,1))
.groupBy(
._1)
.map(x=>(x._1,x.2.map(._2).sum))
--x.2 List((String,Int))
--List[(Int, Int, Int)] = List((1,2,3), (4,5,6), (7,8,9))
--b.map(
._2).sum

--reduce
lines.flatMap(.split(" ")).map(x=>(x,1)).groupBy(._1).map(x=>(x.1,x.2.map(.2).reduce(+)))

sum = reduce(+)
reduce()
list(1,1,1) ((1+1)+1)
sum += x
topN

--排序 按照词频选择top10
lines.flatMap(.split(" ")).map(x=>(x,1)).groupBy(._1)
.map(x=>(x._1,x.2.size))
.toList
.sortBy(
._2)
.reverse
.slice(0,10)

lines.flatMap(.split(" "))
.map(x=>(x,1)).groupBy(
._1)
.map(x=>(x._1,x.2.size))
.toList.sortWith(
.2>._2)

lines.flatMap(.split(" ")).map((,1))
.groupBy(.1)
.mapValues(
.size).toArray
.sortWith(
.2>._2)
.slice(0,10)

rdd DataFrame

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 210,978评论 6 490
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 89,954评论 2 384
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 156,623评论 0 345
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,324评论 1 282
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,390评论 5 384
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 49,741评论 1 289
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,892评论 3 405
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,655评论 0 266
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,104评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,451评论 2 325
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,569评论 1 340
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,254评论 4 328
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,834评论 3 312
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,725评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,950评论 1 264
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,260评论 2 360
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,446评论 2 348

推荐阅读更多精彩内容

  • 这一篇最主要是记录下命令,方便以后查找 使用Mysql 创建数据库 create database mysql_t...
    Treehl阅读 574评论 0 0
  • pyspark.sql模块 模块上下文 Spark SQL和DataFrames的重要类: pyspark.sql...
    mpro阅读 9,448评论 0 13
  • 春运的大规模人口流动让过年的气氛越来越浓,简单地解释就是一大群人离开土生土长的老家,在外打拼,年底一到,又涌着人潮...
    鳕零kelsey阅读 306评论 0 0
  • 怎样临习《勤礼碑》 唐代的《颜勤礼碑》,是颜真卿为其曾祖父颜勤礼立的墓碑,简称《勤礼碑》。 颜真卿(709—785...
    章鱼哥丶阅读 3,173评论 0 2
  • 我想 --石头 我从前十几年 全走弯...
    fhqfhq阅读 154评论 0 0