案例 MovieNow
一家虚构的在线影视租赁公司 MovieNow,它拥有电影信息(类型和主演)以及(用户信息,用户评分)。它需要决定:投资于哪些新电影?
BI(商业智能目标)
- 运营决策的数据
- 根据演员的流行度,决定投资哪一部电影。
- 根据上月的收入,辅助决定,短期的投资预算
- 策略决策的数据
- 在哪国更受欢迎,用来决定市场扩张
- 长线的投资及收入
KPI(关键绩效指标)
- 租赁总收入
- 用户满意度:所有电影的平均评分
- 活跃用户总数:用户参与度
OLAP(联机分析处理)
联机分析处理(英语:Online analytical processing),简称OLAP (/ˈoʊlæp/),是计算机技术中快速解决多维分析问题(英语:multi-dimensional analytical)(MDA)的一种方法。[1] OLAP是更广泛的商业智能范畴的一部分,它还包括关系数据库、报告编写和数据挖掘。[2] OLAP的典型应用包括销售业务报告(英语:business reporting)、市场营销、管理报告、业务流程管理(BPM)[3]、预算和预测、财务报表以及类似领域,新的应用正在出现,如农业。[4] 术语“OLAP”是对传统数据库术语“联机事务处理”(OLTP)稍微修改而成的。[5]
下面主要是上卷(roll-up)和切片(slicing)的例子。
上卷:汇集所有的电影租赁数据、用户评分,查询该租赁网站用户偏好的趋势。
切片:根据不同维度,国家、性别对演员的评分进行筛选查看。
ROLLUP运算符
SELECT
c.country,
m.genre,
AVG(r.rating) AS avg_rating,
COUNT(*) AS num_rating
FROM renting AS r
LEFT JOIN movies AS m
ON m.movie_id = r.movie_id
LEFT JOIN customers AS c
ON r.customer_id = c.customer_id
GROUP BY ROLLUP(c.country, m.genre)
ORDER BY c.country, m.genre;
按照租赁用户所属国家、电影的类型来查看平均评分及评分数目。
相比与 GOURP BY , ROLLUP 包含了所有电影的平均评分,不同国家不限类型电影的平均评分,不同类型电影的平均评分。
CUBE运算符
SELECT
c.country,
m.genre,
AVG(r.rating) AS avg_rating
FROM renting AS r
LEFT JOIN movies AS m
ON m.movie_id = r.movie_id
LEFT JOIN customers AS c
ON r.customer_id = c.customer_id
GROUP BY CUBE(country, genre);
按照租赁用户所属国家、电影的类型来查看平均评分及评分数目。
ROLLUP 结果是 88条结果,CUBE 得到 96 条结果,多出来的结果,是因为 CUBE 生成所选列的所有合计组合。而 ROLLUP 是生成所选列的层级聚合结果。
ROLLUP (YEAR, MONTH, DAY)
With a ROLLUP, it will have the following outputs:
YEAR, MONTH, DAY
YEAR, MONTH
YEAR
()
With CUBE, it will have the following:
YEAR, MONTH, DAY
YEAR, MONTH
YEAR, DAY
YEAR
MONTH, DAY
MONTH
DAY
()
GROUPING SET运算符
SELECT
nationality,
gender,
COUNT(*)
FROM actors
GROUP BY GROUPING SETS ((nationality), (gender), ());
GROUP SET 最灵活
可以返回不同的 GROUP BY 检索的集合
完整BI例子
问题: 投资哪些新电影?需要考虑,新电影比旧电影更贵。
- 用户对新旧电影的评分差异?
- 不同国家用户之间,对新旧电影的评分,有差异吗?
表联结(Join data)
联结三张表,分别是 renting/customers/movies
SELECT *
FROM renting AS r
LEFT JOIN customers AS c
ON c.customer_id = r.customer_id
LEFT JOIN movies AS m
ON m.movie_id = r.movie_id;
选择相关记录(Select relevant records)
SELECT *
FROM renting AS r
LEFT JOIN customers AS c
ON c.customer_id = r.customer_id
LEFT JOIN movies AS m
ON m.movie_id = r.movie_id
WHERE r.movie_id IN (
SELECT movie_id
FROM renting
GROUP BY movie_id
HAVING COUNT(rating) >= 4)
AND r.date_renting >= '2018-04-01';
聚集数据(Aggregation)
SELECT c.country,
m.year_of_release,
COUNT(*) AS n_rentals,
COUNT(DISTINCT r.movie_id) AS n_movies,
AVG(rating) AS avg_rating
FROM renting AS r
LEFT JOIN customers AS c
ON c.customer_id = r.customer_id
LEFT JOIN movies AS m
ON m.movie_id = r.movie_id
WHERE r.movie_id IN (
SELECT movie_id
FROM renting
GROUP BY movie_id
HAVING COUNT(rating) >= 4)
AND r.date_renting >= '2018-04-01'
GROUP BY ROLLUP (m.year_of_release, c.country)
ORDER BY c.country, m.year_of_release;
按照不同的年份、国家来查看租赁次数、电影总数和平均评分
SELECT genre,
AVG(rating) AS avg_rating,
COUNT(rating) AS n_rating,
COUNT(*) AS n_rentals,
COUNT(DISTINCT m.movie_id) AS n_movies
FROM renting AS r
LEFT JOIN movies AS m
ON m.movie_id = r.movie_id
WHERE r.movie_id IN (
SELECT movie_id
FROM renting
GROUP BY movie_id
HAVING COUNT(rating) >= 3 )
AND r.date_renting >= '2018-01-01'
GROUP BY genre
ORDER BY avg_rating DESC;
筛选条件是 2018 年以后的租赁记录,评分超过 3 次的电影。
根据类型 genre 分组,按照平均评分降序排序
SELECT a.nationality,
a.gender,
AVG(r.rating) AS avg_rating,
COUNT(r.rating) AS n_rating,
COUNT(*) AS n_rentals,
COUNT(DISTINCT a.actor_id) AS n_actors
FROM renting AS r
LEFT JOIN actsin AS ai
ON ai.movie_id = r.movie_id
LEFT JOIN actors AS a
ON ai.actor_id = a.actor_id
WHERE r.movie_id IN (
SELECT movie_id
FROM renting
GROUP BY movie_id
HAVING COUNT(rating) >= 4)
AND r.date_renting >= '2018-04-01'
GROUP BY CUBE(nationality, gender);
哪些国家/性别的演员更受欢迎?
根据国籍、性别,查看演员参演的电影之中,所获得的平均评分、租赁字数、演员数目等等。
参考资料:
DataCamp<Data-Driven Decision Making in SQL>
SQL中ROLLUP、CUBE的用法和区别
understanding-the-differences-between-cube-and-rollup