[TOC]
一、Flink SQL 背景
Flink SQL 是 Flink 实时计算为简化计算模型,降低用户使用实时计算门槛而设计的一套符合标准 SQL 语义的开发语言。
自 2015 年开始,阿里巴巴开始调研开源流计算引擎,最终决定基于 Flink 打造新一代计算引擎,针对 Flink 存在的不足进行优化和改进,并且在 2019 年初将最终代码开源,也就是我们熟知的 Blink。Blink 在原来的 Flink 基础上最显著的一个贡献就是 Flink SQL 的实现。
Flink SQL 是面向用户的 API 层,在我们传统的流式计算领域,比如 Storm、Spark Streaming 都会提供一些 Function 或者 Datastream API,用户通过 Java 或 Scala 写业务逻辑,这种方式虽然灵活,但有一些不足,比如具备一定门槛且调优较难,随着版本的不断更新,API 也出现了很多不兼容的地方。
在这个背景下,毫无疑问,SQL 就成了我们最佳选择,之所以选择将 SQL 作为核心 API,是因为其具有几个非常重要的特点:
- SQL 属于设定式语言,用户只要表达清楚需求即可,不需要了解具体做法;
- SQL 可优化,内置多种查询优化器,这些查询优化器可为 SQL 翻译出最优执行计划;
- SQL 易于理解,不同行业和领域的人都懂,学习成本较低;
- SQL 非常稳定,在数据库 30 多年的历史中,SQL 本身变化较少;
- 流与批的统一,Flink 底层 Runtime 本身就是一个流与批统一的引擎,而 SQL 可以做到 API 层的流与批统一。
三、Flink SQL 的编程模型
Flink 的编程模型基础构建模块是流(streams)与转换 (transformations),每一个数据流起始于一个或多个 source,并终止于一个或多个 sink。
相信大家对上面的图已经十分熟悉了,当然基于 Flink SQL 编写的 Flink 程序也离不开读取原始数据,计算逻辑和写入计算结果数据三部分。
一个完整的 Flink SQL 编写的程序包括如下三部分:
Source Operator:Soruce operator 是对外部数据源的抽象, 目前 Apache Flink 内置了很多常用的数据源实现例如 MySQL、Kafka 等;
Transformation Operators:算子操作主要完成例如查询、聚合操作等,目前 Flink SQL 支持了 Union、Join、Projection、Difference、Intersection 及 window 等大多数传统数据库支持的操作;
Sink Operator:Sink operator 是对外结果表的抽象,目前 Apache Flink 也内置了很多常用的结果表的抽象,比如 Kafka Sink 等
我们通过用一个最经典的 WordCount 程序作为入门,看一下传统的基于 DataSet/DataStream API 开发和基于 SQL 开发有哪些不同?
Flink SQL
//省略掉初始化环境等公共代码
SELECT word, COUNT(word) FROM table GROUP BY word;
四、Flink SQL 的语法和算子
4.1 Flink SQL 支持的语法
Flink SQL 核心算子的语义设计参考了 1992、2011 等 ANSI-SQL 标准,Flink 使用 Apache Calcite 解析 SQL ,Calcite 支持标准的 ANSI SQL。
那么 Flink 自身支持的 SQL 语法有哪些呢?
insert:
INSERT INTO tableReference
query
query:
values
| {
select
| selectWithoutFrom
| query UNION [ ALL ] query
| query EXCEPT query
| query INTERSECT query
}
[ ORDER BY orderItem [, orderItem ]* ]
[ LIMIT { count | ALL } ]
[ OFFSET start { ROW | ROWS } ]
[ FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } ONLY]
orderItem:
expression [ ASC | DESC ]
select:
SELECT [ ALL | DISTINCT ]
{ * | projectItem [, projectItem ]* }
FROM tableExpression
[ WHERE booleanExpression ]
[ GROUP BY { groupItem [, groupItem ]* } ]
[ HAVING booleanExpression ]
[ WINDOW windowName AS windowSpec [, windowName AS windowSpec ]* ]
selectWithoutFrom:
SELECT [ ALL | DISTINCT ]
{ * | projectItem [, projectItem ]* }
projectItem:
expression [ [ AS ] columnAlias ]
| tableAlias . *
tableExpression:
tableReference [, tableReference ]*
| tableExpression [ NATURAL ] [ LEFT | RIGHT | FULL ] JOIN tableExpression [ joinCondition ]
joinCondition:
ON booleanExpression
| USING '(' column [, column ]* ')'
tableReference:
tablePrimary
[ [ AS ] alias [ '(' columnAlias [, columnAlias ]* ')' ] ]
tablePrimary:
[ TABLE ] [ [ catalogName . ] schemaName . ] tableName
| LATERAL TABLE '(' functionName '(' expression [, expression ]* ')' ')'
| UNNEST '(' expression ')'
values:
VALUES expression [, expression ]*
groupItem:
expression
| '(' ')'
| '(' expression [, expression ]* ')'
| CUBE '(' expression [, expression ]* ')'
| ROLLUP '(' expression [, expression ]* ')'
| GROUPING SETS '(' groupItem [, groupItem ]* ')'
windowRef:
windowName
| windowSpec
windowSpec:
[ windowName ]
'('
[ ORDER BY orderItem [, orderItem ]* ]
[ PARTITION BY expression [, expression ]* ]
[
RANGE numericOrIntervalExpression {PRECEDING}
| ROWS numericExpression {PRECEDING}
]
')'
上面 SQL 的语法支持也已经表明了 Flink SQL 对算子的支持,接下来我们对 Flink SQL 中最常见的算子语义进行介绍。
4.2 Flink SQL 常用算子
SELECT
SELECT 用于从 DataSet/DataStream 中选择数据,用于筛选出某些列。
示例:
SELECT * FROM Table;// 取出表中的所有列
SELECT name,age FROM Table;// 取出表中 name 和 age 两列
与此同时 SELECT 语句中可以使用函数和别名,例如我们上面提到的 WordCount 中:
SELECT word, COUNT(word) FROM table GROUP BY word;
WHERE
WHERE 用于从数据集/流中过滤数据,与 SELECT 一起使用,用于根据某些条件对关系做水平分割,即选择符合条件的记录。
SELECT name,age FROM Table where name LIKE ‘% 小明 %’;
SELECT * FROM Table WHERE age = 20;
WHERE 是从原数据中进行过滤,那么在 WHERE 条件中,Flink SQL 同样支持 =、<、>、<>、>=、<=,以及 AND、OR 等表达式的组合,最终满足过滤条件的数据会被选择出来。并且 WHERE 可以结合 IN、NOT IN 联合使用。举个负责的例子:
SELECT name, age
FROM Table
WHERE name IN (SELECT name FROM Table2)
DISTINCT
DISTINCT 用于从数据集/流中去重根据 SELECT 的结果进行去重。
SELECT DISTINCT name FROM Table;
六、Flink SQL 实战应用
上面我们分别介绍了 Flink SQL 的背景、新特性、编程模型和常用算子,这部分我们将模拟一个真实的案例为大家使用 Flink SQL 提供一个完整的 Demo。
相信这里应该有很多 NBA 的球迷,假设我们有一份数据记录了每个赛季的得分王的数据,包括赛季、球员、出场、首发、时间、助攻、抢断、盖帽、得分等。现在我们要统计获得得分王荣誉最多的三名球员。
原数据存在 score.csv 文件中,如下:
17-18,詹姆斯-哈登,72,72,35.4,8.8,1.8,0.7,30.4
16-17,拉塞尔-威斯布鲁克,81,81,34.6,10.4,1.6,0.4,31.6
15-16,斯蒂芬-库里,79,79,34.2,6.7,2.1,0.2,30.1
14-15,拉塞尔-威斯布鲁克,67,67,34.4,8.6,2.1,0.2,28.1
13-14,凯文-杜兰特,81,81,38.5,5.5,1.3,0.7,32
12-13,卡梅罗-安东尼,67,67,37,2.6,0.8,0.5,28.7
11-12,凯文-杜兰特,66,66,38.6,3.5,1.3,1.2,28
10-11,凯文-杜兰特,78,78,38.9,2.7,1.1,1,27.7
09-10,凯文-杜兰特,82,82,39.5,2.8,1.4,1,30.1
08-09,德维恩-韦德,79,79,38.6,7.5,2.2,1.3,30.2
07-08,勒布朗-詹姆斯,75,74,40.4,7.2,1.8,1.1,30
06-07,科比-布莱恩特,77,77,40.8,5.4,1.4,0.5,31.6
05-06,科比-布莱恩特,80,80,41,4.5,1.8,0.4,35.4
04-05,阿伦-艾弗森,75,75,42.3,7.9,2.4,0.1,30.7
03-04,特雷西·麦克格雷迪,67,67,39.9,5.5,1.4,0.6,28
02-03,特雷西·麦克格雷迪,75,74,39.4,5.5,1.7,0.8,32.1
01-02,阿伦-艾弗森,60,59,43.7,5.5,2.8,0.2,31.4
00-01,阿伦-艾弗森,71,71,42,4.6,2.5,0.3,31.1
99-00,沙奎尔-奥尼尔,79,79,40,3.8,0.5,3,29.7
98-99,阿伦-艾弗森,48,48,41.5,4.6,2.3,0.1,26.8
97-98,迈克尔-乔丹,82,82,38.8,3.5,1.7,0.5,28.7
96-97,迈克尔-乔丹,82,82,37.9,4.3,1.7,0.5,29.6
95-96,迈克尔-乔丹,82,82,37.7,4.3,2.2,0.5,30.4
94-95,沙奎尔-奥尼尔,79,79,37,2.7,0.9,2.4,29.3
93-94,大卫-罗宾逊,80,80,40.5,4.8,1.7,3.3,29.8
92-93,迈克尔-乔丹,78,78,39.3,5.5,2.8,0.8,32.6
91-92,迈克尔-乔丹,80,80,38.8,6.1,2.3,0.9,30.1
90-91,迈克尔-乔丹,82,82,37,5.5,2.7,1,31.5
89-90,迈克尔-乔丹,82,82,39,6.3,2.8,0.7,33.6
88-89,迈克尔-乔丹,81,81,40.2,8,2.9,0.8,32.5
87-88,迈克尔-乔丹,82,82,40.4,5.9,3.2,1.6,35
86-87,迈克尔-乔丹,82,82,40,4.6,2.9,1.5,37.1
85-86,多米尼克-威尔金斯,78,78,39.1,2.6,1.8,0.6,30.3
84-85,伯纳德-金,55,55,37.5,3.7,1.3,0.3,32.9
83-84,阿德里安-丹特利,79,79,37.8,3.9,0.8,0.1,30.6
82-83,阿历克斯-英格利什,82,82,36.4,4.8,1.4,1.5,28.4
81-82,乔治-格文,79,79,35.7,2.4,1,0.6,32.3
首先我们需要创建一个工程,并且在 Maven 中有如下依赖:
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<flink.version>1.7.1</flink.version>
<slf4j.version>1.7.7</slf4j.version>
<log4j.version>1.2.17</log4j.version>
<scala.binary.version>2.11</scala.binary.version>
</properties>
<dependencies>
<!-- Apache Flink dependencies -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-core</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table_2.11</artifactId>
<version>1.7.1</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala_${scala.binary.version}</artifactId>
<version>1.7.1</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>${log4j.version}</version>
</dependency>
第一步,创建上下文环境:
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
BatchTableEnvironment tableEnv = BatchTableEnvironment.getTableEnvironment(env);
第二步,读取 score.csv 并且作为 source 输入:
DataSet<String> input = env.readTextFile("score.csv");
DataSet<PlayerData> topInput = input.map(new MapFunction<String, PlayerData>() {
@Override
public PlayerData map(String s) throws Exception {
String[] split = s.split(",");
return new PlayerData(String.valueOf(split[0]),
String.valueOf(split[1]),
String.valueOf(split[2]),
Integer.valueOf(split[3]),
Double.valueOf(split[4]),
Double.valueOf(split[5]),
Double.valueOf(split[6]),
Double.valueOf(split[7]),
Double.valueOf(split[8])
);
}
});
其中的PlayerData类为自定义类:
public static class PlayerData {
/**
* 赛季,球员,出场,首发,时间,助攻,抢断,盖帽,得分
*/
public String season;
public String player;
public String play_num;
public Integer first_court;
public Double time;
public Double assists;
public Double steals;
public Double blocks;
public Double scores;
public PlayerData() {
super();
}
public PlayerData(String season,
String player,
String play_num,
Integer first_court,
Double time,
Double assists,
Double steals,
Double blocks,
Double scores
) {
this.season = season;
this.player = player;
this.play_num = play_num;
this.first_court = first_court;
this.time = time;
this.assists = assists;
this.steals = steals;
this.blocks = blocks;
this.scores = scores;
}
}
第三步,将 source 数据注册成表:
Table topScore = tableEnv.fromDataSet(topInput);
tableEnv.registerTable("score", topScore);
第四步,核心处理逻辑 SQL 的编写:
Table queryResult = tableEnv.sqlQuery("
select player,
count(season) as num
FROM score
GROUP BY player
ORDER BY num desc
LIMIT 3
");
第五步,输出结果:
DataSet<Result> result = tableEnv.toDataSet(queryResult, Result.class);
result.print();
我们直接运行整个程序,观察输出结果:
迈克尔-乔丹:10
凯文-杜兰特:4
阿伦-艾弗森:4
完整的代码如下:
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.table.api.Table;
import org.apache.flink.table.api.java.BatchTableEnvironment;
public class TableSQL {
public static void main(String[] args) throws Exception{
//1\. 获取上下文环境 table的环境
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
BatchTableEnvironment tableEnv = BatchTableEnvironment.getTableEnvironment(env);
//2\. 读取score.csv
DataSet<String> input = env.readTextFile("score.csv");
input.print();
DataSet<PlayerData> topInput = input.map(new MapFunction<String, PlayerData>() {
@Override
public PlayerData map(String s) throws Exception {
String[] split = s.split(",");
return new PlayerData(String.valueOf(split[0]),
String.valueOf(split[1]),
String.valueOf(split[2]),
Integer.valueOf(split[3]),
Double.valueOf(split[4]),
Double.valueOf(split[5]),
Double.valueOf(split[6]),
Double.valueOf(split[7]),
Double.valueOf(split[8])
);
}
});
//3\. 注册成内存表
Table topScore = tableEnv.fromDataSet(topInput);
tableEnv.registerTable("score", topScore);
//4\. 编写sql 然后提交执行
//select player, count(season) as num from score group by player order by num desc;
Table queryResult = tableEnv.sqlQuery("select player, count(season) as num from score group by player order by num desc limit 3");
//5\. 结果进行打印
DataSet<Result> result = tableEnv.toDataSet(queryResult, Result.class);
result.print();
}
public static class PlayerData {
/**
* 赛季,球员,出场,首发,时间,助攻,抢断,盖帽,得分
*/
public String season;
public String player;
public String play_num;
public Integer first_court;
public Double time;
public Double assists;
public Double steals;
public Double blocks;
public Double scores;
public PlayerData() {
super();
}
public PlayerData(String season,
String player,
String play_num,
Integer first_court,
Double time,
Double assists,
Double steals,
Double blocks,
Double scores
) {
this.season = season;
this.player = player;
this.play_num = play_num;
this.first_court = first_court;
this.time = time;
this.assists = assists;
this.steals = steals;
this.blocks = blocks;
this.scores = scores;
}
}
public static class Result {
public String player;
public Long num;
public Result() {
super();
}
public Result(String player, Long num) {
this.player = player;
this.num = num;
}
@Override
public String toString() {
return player + ":" + num;
}
}
}//
当然我们也可以自定义一个 Sink,将结果输出到一个文件中,例如:
TableSink sink = new CsvTableSink("/home/result.csv", ",");
String[] fieldNames = {"name", "num"};
TypeInformation[] fieldTypes = {Types.STRING, Types.INT};
tableEnv.registerTableSink("result", fieldNames, fieldTypes, sink);
sqlQuery.insertInto("result");
env.execute();
然后我们运行程序,可以看到 /home 目录下生成的 result.csv,查看结果:
迈克尔-乔丹,10
凯文-杜兰特,4
阿伦-艾弗森,4
小结
本篇向大家介绍了 Flink SQL 产生的背景,Flink SQL 大部分核心功能,并且分别介绍了 Flink SQL 的编程模型和常用算子及内置函数。最后以一个完整的示例展示了如何编写 Flink SQL 程序。Flink SQL 的简便易用极大地降低了 Flink 编程的门槛,是我们必需掌握的使用 Flink 解决流式计算问题最锋利的武器!