需求分析
- 日志格式
"27.38.5.159" "-" "31/Aug/2015:00:04:37 +0800" "GET /course/view.php?id=27 HTTP/1.1" "303" "440" - "http://www.ibeifeng.com/user.php?act=mycourse" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36" "-" "learn.ibeifeng.com"
"27.38.5.159" "-" "31/Aug/2015:00:04:37 +0800" "GET /login/index.php HTTP/1.1" "303" "465" - "http://www.ibeifeng.com/user.php?act=mycourse" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36" "-" "learn.ibeifeng.com"
数据清洗
- 创建源表
由于日志文件中,字段与字段之间以空格分割,而且每个字段中也存在空格,所以不能直接按照空格作为分隔符加载数据。所以利用正则表达式去匹配每一个字段,其中正则表达式中的每一个小括号的内容即为一个字段。
create table IF NOT EXISTS source_log (
remote_addr string,
remote_user string,
time_local string,
request string,
status string,
body_bytes_sent string,
request_body string,
http_referer string,
http_user_agent string,
http_x_forwarded_for string,
host string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(\"[^ ]*\") (\"[-|^ ]*\") (\"[^\"]*\") (\"[^\"]*\") (\"[0-9]*\") (\"[0-9]*\") ([-|^ ]*) (\"[^ ]*\") (\"[^\"]*\") (\"[-|^ ]*\") (\"[^ ]*\")"
)
STORED AS TEXTFILE;
- 加载日志数据到源表中
加载:
load data local inpath '/opt/datas/moodle.ibeifeng.access.log' into talbe source_log;
检查是否加载成功
hive (log)> select time_local from source_log limit 5;
- 过滤字段
分析需求,只需要 用户的IP地址、时间、请求地址、referer地址(从什么链接中来访问的)。
这里采用parquet存储格式以节约存储空间。
创建一个新表
create table clean_log(
remote_addr string,
time_local string,
request string,
http_referer string
)
row format delimited fields terminated by '\t'
stored as parquet tblproperties("parquet.compress"="SNAPPY");
- 字段格式化
- 去掉每个字段的双引号
- 日期格式转换
- 截取请求地址
- 截取referer主地址
自定UDF实现以上三个功能并在Hive中加载,详细步骤见下面我以前写的文章:
//www.greatytc.com/p/c8a8aa590e1e
//www.greatytc.com/p/ff0913045610
函数加载到Hive后的结果:
去除双引号:rm
日期转换:df
网址截取:ct
referer : rf
关联自定义函数
hive (log)> create temporary function rm as "hiveUDF.RemoveQuote";
hive (log)> create temporary function df as "hiveUDF.DateTransform";
hive (log)> create temporary function rf as "hiveUDF.TruncationMainAdd";
hive (log)> create temporary function ct as "hiveUDF.TruncationRequestAdd";
格式化后加载数据到新表中
insert overwrite table clean_log
select rm(remote_addr),df(rm(time_local)),ct(rm(request)),rf(rm(http_referer))
from source_log;