背景
需要对用户的访问日志记录进行转化分析,由于日志记录没有业务session,需要物理地对访问记录进行session切分,规则是30分钟内的用户操作为一个会话(session),如果两个操作之间的时间间隔超过30分钟,则判定为两个session。
环境:GreenPlum(postgresql)
数据
/*
用户ID:bid
访问时间:event_time
页面key:event_key
*/
WITH log_data AS(
SELECT
bid,
event_time::TIMESTAMP AS event_time,
event_key,
/* 访问序号 */
ROW_NUMBER() OVER(PARTITION BY bid ORDER BY event_time) AS num
FROM
(
SELECT '123' AS bid, '2017-11-20 10:01:56' AS event_time, 'page1' AS event_key
UNION ALL
SELECT '123' AS bid, '2017-11-20 10:05:56' AS event_time, 'page2' AS event_key
UNION ALL
SELECT '123' AS bid, '2017-11-20 10:43:56' AS event_time, 'page1' AS event_key
UNION ALL
SELECT '123' AS bid, '2017-11-20 10:46:56' AS event_time, 'page2' AS event_key
UNION ALL
SELECT '123' AS bid, '2017-11-20 10:49:56' AS event_time, 'page3' AS event_key
)tmp
)
SELECT * FROM log_data;
切分session过程
需要找到一个记录A的时间减去上一个记录B的时间差值大于30分钟,如下图中,需要找到箭头指向的那条10:43:56访问的记录
问题转化为寻找每个session的起始访问记录,详细sql如下:
SELECT
t1.*,
t2.num,
/* 如果记录为首次或两个记录时间差大于30分钟则打上标记 */
CASE WHEN t2. event_time is NULL OR t1.event_time - t2.event_time > INTERVAL '30 minute' THEN t1.num ELSE null END AS time_diff
FROM
log_data t1
LEFT JOIN
/* 前后记录关联 */
log_data t2 on t1.bid = t2.bid AND t1.num = t2.num + 1
根据t1.bid = t2.bid AND t1.num = t2.num + 1关联得到切分点的num,如下图中的time_diff
通过time_diff is not null 得到具体的切分点
SELECT
bid,
time_diff,
event_time
FROM
(
SELECT
t1.*,
t2.num,
/* 如果记录为首次或两个记录时间差大于30分钟则打上标记 */
CASE WHEN t2. event_time is NULL OR t1.event_time - t2.event_time > INTERVAL '30 minute' THEN t1.num ELSE null END AS time_diff
FROM
log_data t1
LEFT JOIN
/* 前后记录关联 */
log_data t2 on t1.bid = t2.bid AND t1.num = t2.num + 1
)t02
WHERE
t02.time_diff IS NOT NULL
图一跟图四通过t01.bid = t02.bid AND t01.num >= t02.time_diff关联排序过滤最终完成session切分
SELECT
*
FROM
(
SELECT
t01.*,
t02.event_time,
time_diff,
/* 取分等值连接后的第一条数据 */
ROW_NUMBER() OVER(PARTITION BY t01.bid,t01.num ORDER BY t02.time_diff desc) AS num_2
FROM
log_data t01
LEFT JOIN
(
SELECT
bid,
time_diff,
event_time
FROM
(
SELECT
t1.*,
t2.num,
/* 如果记录为首次或两个记录时间差大于30分钟则打上标记 */
CASE WHEN t2. event_time is NULL OR t1.event_time - t2.event_time > INTERVAL '30 minute' THEN t1.num ELSE null END AS time_diff
FROM
log_data t1
LEFT JOIN
/* 前后记录关联 */
log_data t2 on t1.bid = t2.bid AND t1.num = t2.num + 1
)t02
WHERE
t02.time_diff IS NOT NULL
)t02 ON t01.bid = t02.bid AND t01.num >= t02.time_diff
)tmp
/* 取num_2为1的记录 */
WHERE num_2 = 1
再添加session_id字段即可,本文采用md5(bid+event_time1)作为session_id