MySQL:Binlog rotate 引发的阻塞事件

旧文补录，2017年的一次故障处理，像破案一样层层解析出了最终真相。今年看了一本书叫《发现心流》，我肯定当时在分析这个问题时获得了心流体验。

背景

在大型MySQL集群增加了新的库，运行正常（没有开启GTID）。几天后将新加入的库开启GTID。
MySQL版本：5.7.19

故障现象

变更后，当业务压力开始上升，发现其中一个节点以大概半小时的频率，业务会间断性地发生阻塞。

分析过程

1，观察故障现象

根据现场描述，大概每半小时发生一次故障。在故障发生时，观测到以下现象：

a. show processlist 发现有 4000+ 线程正在执行 insert，执行时间在10～300s
b. show global status like 'threads_running' 无响应
c. top 观察CPU只使用到1个核，内存正常，iowait 值3%左右
d. iostat 观察到700M/s read，随后使用 iotop 观察有一个MySQL线程700M/s read

2，定位700M/s read线程

显然700M/s的读IO极不正常，问题突破点就在这里，接下来我们要找到数据库在做什么导致了这么高的读IO。 iotop 命令可查找到问题线程TID，通过 performance_schema.threads 表查找到对应的 MySQL 内部真实线程：
（以下截图均为测试环境复现截图，虚拟机IO差很多只有97M/s）

查得MySQL中线程正在做一个insert，且只插入一行数据。

第一反应是insert为什么会造成700M/s的读IO，并且会持续3-5分钟之久？分析insert语句，只插入1行数据，更加不可理解。几次故障观察到同一现象。

3，分析binlog

猜测：这是一个大事务，事务包含了大量 insert 语句。解析 binlog，使用 insert 语句中的唯一ID过滤 binlog，发现：

a. 这个 insert 所在事务只有1条 insert；
b. 这个 insert 所在事务位于 binlog 最末尾。

得到结论：insert 语句正常，并非大事务。那是什么原因？“insert 所在事务位于 binlog 最末尾”好像有点意思。

继续解析 binlog，发现几次故障时的 insert 所在事务也是位于 binlog 最末尾。这不是巧合，应该与 binlog rotate 有关！好像发现了什么！

4，确认故障与 binlog rotate 相关联

前面提到每次故障发生，MySQL 的 thread_running 会飙升至 4000 左右，并且无响应。通过这个信息，我们打开监控图表，对照每个 binlog 的刷新时间点，基本吻合。到这里，我们基本可以认为故障与 binlog rotate 是相关联的。接下来就是验证了。

5，复现与验证

情况紧急，当时在现场接下来的操作需要在从库（因为从库也发生过故障导致了主从切换）验证，从库执行 "flush logs" 手工刷新 binlog，并使用 strace 观测 mysql 到底做了什么导致了 700M/s 的读 IO。

a. 开启 strace，执行 strace -o /tmp/strace_output.txt -T -tt -f -e trace=read,open -p 'mysql_pid'
b. 从库执行 "flush logs"；
c. iotop 观察到有一个700M/s读的线程，问题复现！
d. 关闭 strace

6，原因定位

（以下截图均为测试环境复现截图，虚拟机IO差很多只有97M/s）

上一步复现了故障，并且获取 strace 信息，接下来进行分析，看到 strace 信息中大量重复的文件句柄号：

统计 strace 信息中文件句柄72 的总行数，占 strace 信息总行数的90%以上，由此可大致判断：对文件句柄72存在大量读操作：

查看MySQL进程下的文件句柄，发现文件句柄 72 对应文件为 binlog：

再分析 strace 信息中 open函数，发现 open 的文件都是 binlog，且在按顺序打开 binlog 文件！离真相越来越近了：）

这时我们脑海里会出现一个问题：为什么binlog刷新时会从头至尾扫描binlog？

扫描脑子里的知识库：binlog_gtid_simple_recovery，以下是MySQL官方手册解释：

This variable controls how binary log files are iterated during the search for GTIDs when MySQL starts or restarts. In MySQL version 5.7.5, this variable was added as simplified_binlog_gtid_recovery and in MySQL version 5.7.6 it was renamed to binlog_gtid_simple_recovery.

When binlog_gtid_simple_recovery=FALSE, the method of iterating the binary log files is:

To initialize gtid_executed, binary log files are iterated from the newest file, stopping at the first binary log that has anyPrevious_gtids_log_event. All GTIDs from Previous_gtids_log_event and Gtid_log_events are read from this binary log file. This GTID set is stored internally and called gtids_in_binlog. The value of gtid_executed is computed as the union of this set and the GTIDs stored in the mysql.gtid_executed table.

This process could take a long time if you had a large number of binary log files without GTID events, for example created whengtid_mode=OFF.
To initialize gtid_purged, binary log files are iterated from the oldest to the newest, stopping at the first binary log that contains either aPrevious_gtids_log_event that is non-empty (that has at least one GTID) or that has at least one Gtid_log_event. From this binary log it reads Previous_gtids_log_event. This GTID set is subtracted from gtids_in_binlog and the result stored in the internal variablegtids_in_binlog_not_purged. The value of gtid_purged is initialized to the value of gtid_executed, minusgtids_in_binlog_not_purged.

When binlog_gtid_simple_recovery=TRUE, which is the default in MySQL 5.7.7 and later, the server iterates only the oldest and the newest binary log files and the values of gtid_purged and gtid_executed are computed based only on Previous_gtids_log_event orGtid_log_event found in these files. This ensures only two binary log files are iterated during server restart or when binary logs are being purged.

结合源码归纳后可以理解为：
当MySQL重启，且binlog_gtid_simple_recovery=false：

更新gtid_executed，从最新的binlog开始扫描，直到找到第一个有Previous_gtids_log_event的binlog为止；
更新gtid_purged，从最旧的binlog开始扫描，直到找到第一个有Previous_gtids_log_event的binlog为止；

当MySQL重启，且binlog_gtid_simple_recovery=true：

更新gtid_executed，只扫描最新的那个 binlog，如果这个 binlog 不包含 gtid，直接将 GLOBAL.GTID_EXECUTED 设置为空；
更新gtid_purged，只扫描最旧的那个 binlog，如果这个 binlog 不包含 gtid，直接将 GLOBAL.GTID_PURGED 设置为空；

当删除 binlog，且 binlog_gtid_simple_recovery=false：

更新 gtid_purged，从最旧的 binlog 开始扫描，直到找到第一个有 Previous_gtids_log_event 的 binlog 为止；

当删除 binlog，且binlog_gtid_simple_recovery=true：

更新 gtid_purged，只扫描最旧的那个 binlog，如果这个 binlog 不包含 gtid，直接将 GLOBAL.GTID_PURGED 设置为空；

另外还有一种情况就是relay_log_recovery=0的情况，扫描relay_log以更新all_gtids (Retrieved_Gtid_Set)

故障MySQL设置是 binlog_gtid_simple_recovery=false，并且变更前是没有开启 GTID 的，意味着有一百多个 binlog 没有记录GTID，所以大量的读 IO 发生在扫描这一百多个 binlog，符合故障现象。

注意：在MySQL5.7.6以后，binlog_gtid_simple_recovery 默认值为 true，且上述解释与是否开启 GTID 无关。

再等等，好像有什么不对？我们的故障现场是 binlog rotate（包括手工flush logs），和手册上描述的 “MySQL重启”、“binlog被删除” 不是一回事啊！排除法：故障时MySQL并没有重启，那就是“binlog被删除”导致的了。

解到这里只剩最后一个问题了：binlog rotate（和手工flush logs）会导致binlog被删除吗？答案是：expire_logs_days

再看下官方手册：
The number of days for automatic binary log file removal. The default is 0, which means “no automatic removal.” Possible removals happen at startup and when the binary log is flushed. Log flushing occurs as indicated in Section 5.4, “MySQL Server Logs”.

翻译一下就是：当开启了 expire_logs_days，MySQL会自动删除 binlog，触发条件是“启动”和“binlog的刷新”。再次验证，在数据库中执行purge binary logs to 'mysql-bin.xxx'，观察 strace，iotop 能够稳定复现故障现象。

结论

受参数 expire_logs_days=5 影响，每次 binlog rotate 时会触发 binlog 的自动删除，当 binlog 被删除时：由于参数 binlog_gtid_simple_recovery=false，MySQL 会从头至尾扫描 binlog 直到找到第一个有 Previous_gtids_log_event 的 binlog 为止。但是MySQL一开始没有开启GTID，之前的一百多个 binlog 都没有记录 GTID，大致了大量的长时间的磁盘 IO，最终造成了数据库阻塞。

解决方案

删除没有记录 GTID 的 binlog，因为 purge binary logs to 'xxx' 命令会触发 binlog 扫描，所以这里只能使用 rm 物理文件的方式进行删除，删除后将 mysql-bin.index 文件中对应的行删除，注意不留空行；
修改配置文件中 binlog_gtid_simple_recovery=true（不能动态修改）。

1. rm -f mysql-bin.xxx
2. vim mysql-bin.index，删除对应记录
3. 登陆MySQL，执行flush logs</pre>

如何避免

创建数据库时直接启用 GTID，GTID 的好处谁用谁知道；
无论数据库是否开启 GTID，设置 binlog_gtid_simple_recovery=true；
当 MySQL 由非 GTID 模式升级为开启 GTID 时：
1. 设置参数binlog_gtid_simple_recovery=true；
2. 重启MySQL后，备份binlog，执行reset master；
3. 运行后不要切换GTID模式，以免造成GTID的紊乱。