一、问题
这个问题大概是这样。当前主从状态如下有2个半同步从库,如果停掉1个半同步从库是是否会堵塞主库,堵塞时间大概是多久?
mysql> show global status like '%semi%';
+--------------------------------------------+-----------+
| Variable_name | Value |
+--------------------------------------------+-----------+
| Rpl_semi_sync_master_clients | 2 |
| Rpl_semi_sync_master_net_avg_wait_time | 0 |
| Rpl_semi_sync_master_net_wait_time | 0 |
| Rpl_semi_sync_master_net_waits | 8454 |
| Rpl_semi_sync_master_no_times | 9 |
| Rpl_semi_sync_master_no_tx | 1597 |
| Rpl_semi_sync_master_status | ON |
| Rpl_semi_sync_master_timefunc_failures | 0 |
| Rpl_semi_sync_master_tx_avg_wait_time | 77630 |
| Rpl_semi_sync_master_tx_wait_time | 169312495 |
| Rpl_semi_sync_master_tx_waits | 2181 |
| Rpl_semi_sync_master_wait_pos_backtraverse | 0 |
| Rpl_semi_sync_master_wait_sessions | 0 |
| Rpl_semi_sync_master_yes_tx | 2181 |
| Rpl_semi_sync_slave_status | OFF |
+--------------------------------------------+-----------+
15 rows in set (0.01 sec)
mysql> show variables like '%semi%';
+-------------------------------------------+---------------------+
| Variable_name | Value |
+-------------------------------------------+---------------------+
| rpl_semi_sync_master_enabled | ON |
| rpl_semi_sync_master_timeout | 1000000000000000000 |
| rpl_semi_sync_master_trace_level | 32 |
| rpl_semi_sync_master_wait_for_slave_count | 2 |
| rpl_semi_sync_master_wait_no_slave | OFF |
| rpl_semi_sync_master_wait_point | AFTER_SYNC |
| rpl_semi_sync_slave_enabled | ON |
| rpl_semi_sync_slave_trace_level | 32 |
+-------------------------------------------+---------------------+
二、分析
首先从设置中这里,
- rpl_semi_sync_master_timeout:1000000000000000000 设置为几乎无限大
- rpl_semi_sync_master_wait_for_slave_count:2 设置为2代表需要2个半同步落盘才返回ACK。
- rpl_semi_sync_master_wait_no_slave:OFF 设置为了OFF(默认为ON)
而从当前的状态来看,
- Rpl_semi_sync_master_clients 存在2个半同步从库
- Rpl_semi_sync_master_status 为ON,处于半同步状态。
这种情况下虽然rpl_semi_sync_master_timeout设置了很大值,但是如果2个半同步中一个关闭后,是否需要等待rpl_semi_sync_master_timeout设置的值呢?其实这就和rpl_semi_sync_master_wait_no_slave 参数有关,因为设置为off,那么当半同步的dump线程退出的时候那么会判断当前半同步的数量是否满足rpl_semi_sync_master_wait_for_slave_count 参数设置,如果不满足则马上降级为异步复制(回调函数repl_semi_binlog_dump_end) 。
但是dump线程的退出并不是及时的,需要等到一次心跳超时再次发送心跳的时候发现无法发送才会停止dump线程,而线上繁忙的系统,通常DUMP线程不会处于等待心跳发送的情况下的,但是一但回话线程在等待ACK上堵塞,那么binlog就不会再更改,dump线程也就没有事可做,只能触发心跳发送,但是发送心跳前会等待一个心跳周期为slave_net_timeout/2,如下,
Binlog_sender::wait_with_heartbeat
do
{
set_timespec_nsec(&ts, m_heartbeat_period); //心跳周期
ret= mysql_bin_log.wait_for_update_bin_log(m_thd, &ts);//如果没有binlog生成,也会不会被提前唤醒,等待到超时
if (ret != ETIMEDOUT && ret != ETIME) //如果是正常收到 发送信号则返回 否则如果是超时则发送心跳event,进行心跳检测
break; //正常返回0 是超时返回ETIMEDOUT 继续循环
...
if (send_heartbeat_event(log_pos)) //发送心跳
return 1;
} while (!m_thd->killed);
slave_net_timeout默认为60秒,因此心跳周期默认为是30秒的时间,也就是参数slave_net_timeout/2,也就是这种情况下,半同步降级可能来自2个方面,
- 通过repl_semi_report_binlog_sync 回调函数,用户线程ACK超时直接降级,这依赖rpl_semi_sync_master_timeout参数
- 心跳发送超时未接收到,导致dump线程退出,退出触发回调函数repl_semi_binlog_dump_end,判断当前剩余半桶数量是否小于rpl_semi_sync_master_wait_for_slave_count 参数设置降级。
因此此处降级的时间大概应该是min(slave_net_timeout/2,rpl_semi_sync_master_timeout(微秒)),取一个小值。
当然如果rpl_semi_sync_master_wait_no_slave保持默认值ON,则必须等待rpl_semi_sync_master_timeout参数设置的大小才降级。
三、结论和测试截图
因此这里我们看到这个配置下(rpl_semi_sync_master_wait_no_slave=OFF,rpl_semi_sync_master_wait_for_slave_count =2),停掉一个从库实际上会影响到主库但是在slave_net_timeout/2超时后会降级,默认是30秒,也就是30秒降级,当然如果rpl_semi_sync_master_timeout是默认的10秒,那么优先使用rpl_semi_sync_master_timeout的时间作为降级标准。这种情况下,5.7 测试截图如下,
8.0测试如下,
四、超时栈
(gdb) bt
#0 0x00007ffff7bcade2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00000000018a9cb7 in native_cond_timedwait (cond=0x2dc1100 <mysql_bin_log+1824>, mutex=0x2dc1048 <mysql_bin_log+1640>, abstime=0x7fffec165e10)
at /opt/percona-server-locks-detail-5.7.22/include/thr_cond.h:129
#2 0x00000000018aa01c in safe_cond_timedwait (cond=0x2dc1100 <mysql_bin_log+1824>, mp=0x2dc1020 <mysql_bin_log+1600>, abstime=0x7fffec165e10,
file=0x2221bd8 "/opt/percona-server-locks-detail-5.7.22/sql/binlog.cc", line=8362) at /opt/percona-server-locks-detail-5.7.22/mysys/thr_cond.c:88
#3 0x0000000001802a65 in my_cond_timedwait (cond=0x2dc1100 <mysql_bin_log+1824>, mp=0x2dc1020 <mysql_bin_log+1600>, abstime=0x7fffec165e10,
file=0x2221bd8 "/opt/percona-server-locks-detail-5.7.22/sql/binlog.cc", line=8362) at /opt/percona-server-locks-detail-5.7.22/include/thr_cond.h:180
#4 0x0000000001802f61 in inline_mysql_cond_timedwait (that=0x2dc1100 <mysql_bin_log+1824>, mutex=0x2dc1020 <mysql_bin_log+1600>, abstime=0x7fffec165e10,
src_file=0x2221bd8 "/opt/percona-server-locks-detail-5.7.22/sql/binlog.cc", src_line=8362) at /opt/percona-server-locks-detail-5.7.22/include/mysql/psi/mysql_thread.h:1229
#5 0x0000000001814919 in MYSQL_BIN_LOG::wait_for_update_bin_log (this=0x2dc09e0 <mysql_bin_log>, thd=0x7ffee40008c0, timeout=0x7fffec165e10)
at /opt/percona-server-locks-detail-5.7.22/sql/binlog.cc:8362
#6 0x0000000001839283 in Binlog_sender::wait_with_heartbeat (this=0x7fffec1665b0, log_pos=194) at /opt/percona-server-locks-detail-5.7.22/sql/rpl_binlog_sender.cc:652
#7 0x00000000018391d2 in Binlog_sender::wait_new_events (this=0x7fffec1665b0, log_pos=194) at /opt/percona-server-locks-detail-5.7.22/sql/rpl_binlog_sender.cc:631
#8 0x0000000001838e87 in Binlog_sender::get_binlog_end_pos (this=0x7fffec1665b0, log_cache=0x7fffec165fc0) at /opt/percona-server-locks-detail-5.7.22/sql/rpl_binlog_sender.cc:397
#9 0x00000000018369c5 in Binlog_sender::send_binlog (this=0x7fffec1665b0, log_cache=0x7fffec165fc0, start_pos=123) at /opt/percona-server-locks-detail-5.7.22/sql/rpl_binlog_sender.cc:345
#10 0x0000000001836383 in Binlog_sender::run (this=0x7fffec1665b0) at /opt/percona-server-locks-detail-5.7.22/sql/rpl_binlog_sender.cc:226
#11 0x0000000001834074 in mysql_binlog_send (thd=0x7ffee40008c0, log_ident=0x7fffec1671e0 "", pos=4, slave_gtid_executed=0x7fffec166fb0, flags=0)
at /opt/percona-server-locks-detail-5.7.22/sql/rpl_master.cc:414
#12 0x0000000001833f40 in com_binlog_dump_gtid (thd=0x7ffee40008c0, packet=0x7ffee40063f1 "", packet_length=124) at /opt/percona-server-locks-detail-5.7.22/sql/rpl_master.cc:398
#13 0x00000000015671d6 in dispatch_command (thd=0x7ffee40008c0, com_data=0x7fffec167d90, command=COM_BINLOG_DUMP_GTID) at /opt/percona-server-locks-detail-5.7.22/sql/sql_parse.cc:1705
#14 0x00000000015655c5 in do_command (thd=0x7ffee40008c0) at /opt/percona-server-locks-detail-5.7.22/sql/sql_parse.cc:1021
#15 0x00000000016a635c in handle_connection (arg=0x64b1d50) at /opt/percona-server-locks-detail-5.7.22/sql/conn_handler/connection_handler_per_thread.cc:312
#16 0x00000000018ce0f6 in pfs_spawn_thread (arg=0x5d3bc20) at /opt/percona-server-locks-detail-5.7.22/storage/perfschema/pfs.cc:2190
#17 0x00007ffff7bc6ea5 in start_thread () from /lib64/libpthread.so.0
#18 0x00007ffff66008dd in clone () from /lib64/libc.so.6
(gdb) c
Continuing.
2024-10-21T08:24:34.739778Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 8700ms. The settings might not be optimal. (flushed=0, during the time.)
Breakpoint 3, repl_semi_binlog_dump_end (param=0x7fffec165ee0) at /opt/percona-server-locks-detail-5.7.22/plugin/semisync/semisync_master_plugin.cc:164
164 bool semi_sync_slave= is_semi_sync_dump();
(gdb) bt
#0 repl_semi_binlog_dump_end (param=0x7fffec165ee0) at /opt/percona-server-locks-detail-5.7.22/plugin/semisync/semisync_master_plugin.cc:164
#1 0x000000000148c6f6 in Binlog_transmit_delegate::transmit_stop (this=0x2d89fc0 <delegates_init()::transmit_mem>, thd=0x7ffee40008c0, flags=0)
at /opt/percona-server-locks-detail-5.7.22/sql/rpl_handler.cc:729
#2 0x0000000001836169 in Binlog_sender::cleanup (this=0x7fffec1665b0) at /opt/percona-server-locks-detail-5.7.22/sql/rpl_binlog_sender.cc:176
#3 0x00000000018368b9 in Binlog_sender::run (this=0x7fffec1665b0) at /opt/percona-server-locks-detail-5.7.22/sql/rpl_binlog_sender.cc:312
#4 0x0000000001834074 in mysql_binlog_send (thd=0x7ffee40008c0, log_ident=0x7fffec1671e0 "", pos=4, slave_gtid_executed=0x7fffec166fb0, flags=0)
at /opt/percona-server-locks-detail-5.7.22/sql/rpl_master.cc:414
#5 0x0000000001833f40 in com_binlog_dump_gtid (thd=0x7ffee40008c0, packet=0x7ffee40063f1 "", packet_length=124) at /opt/percona-server-locks-detail-5.7.22/sql/rpl_master.cc:398
#6 0x00000000015671d6 in dispatch_command (thd=0x7ffee40008c0, com_data=0x7fffec167d90, command=COM_BINLOG_DUMP_GTID) at /opt/percona-server-locks-detail-5.7.22/sql/sql_parse.cc:1705
#7 0x00000000015655c5 in do_command (thd=0x7ffee40008c0) at /opt/percona-server-locks-detail-5.7.22/sql/sql_parse.cc:1021
#8 0x00000000016a635c in handle_connection (arg=0x64b1d50) at /opt/percona-server-locks-detail-5.7.22/sql/conn_handler/connection_handler_per_thread.cc:312
#9 0x00000000018ce0f6 in pfs_spawn_thread (arg=0x5d3bc20) at /opt/percona-server-locks-detail-5.7.22/storage/perfschema/pfs.cc:2190
#10 0x00007ffff7bc6ea5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007ffff66008dd in clone () from /lib64/libc.so.6
(gdb) thread 73
[Switching to thread 73 (Thread 0x7fffd8161700 (LWP 1887))]
#0 0x00007ffff7bcade2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) bt
#0 0x00007ffff7bcade2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x0000000004802623 in native_cond_timedwait (cond=0x7e0efa8 <mysql_bin_log+1096>, mutex=0xa2d21d8, abstime=0x7fffd815e280) at /newdata/mysql-8.0.23/include/thr_cond.h:100
#2 0x000000000480298e in safe_cond_timedwait (cond=0x7e0efa8 <mysql_bin_log+1096>, mp=0xa2d21b0, abstime=0x7fffd815e280, file=0x61b1de8 "/newdata/mysql-8.0.23/sql/binlog.cc", line=7557)
at /newdata/mysql-8.0.23/mysys/thr_cond.cc:113
#3 0x00000000043b8259 in my_cond_timedwait (cond=0x7e0efa8 <mysql_bin_log+1096>, mp=0x7e0ef48 <mysql_bin_log+1000>, abstime=0x7fffd815e280, file=0x61b1de8 "/newdata/mysql-8.0.23/sql/binlog.cc",
line=7557) at /newdata/mysql-8.0.23/include/thr_cond.h:147
#4 0x00000000043b8496 in inline_mysql_cond_timedwait (that=0x7e0efa8 <mysql_bin_log+1096>, mutex=0x7e0ef48 <mysql_bin_log+1000>, abstime=0x7fffd815e280,
src_file=0x61b1de8 "/newdata/mysql-8.0.23/sql/binlog.cc", src_line=7557) at /newdata/mysql-8.0.23/include/mysql/psi/mysql_cond.h:241
#5 0x00000000043cb6a6 in MYSQL_BIN_LOG::wait_for_update (this=0x7e0eb60 <mysql_bin_log>, timeout=0x7fffd815e280) at /newdata/mysql-8.0.23/sql/binlog.cc:7557
#6 0x000000000448acaf in Binlog_sender::wait_with_heartbeat (this=0x7fffd815eb60, log_pos=316133) at /newdata/mysql-8.0.23/sql/rpl_binlog_sender.cc:796
#7 0x0000000004489023 in Binlog_sender::wait_new_events (this=0x7fffd815eb60, log_pos=316133) at /newdata/mysql-8.0.23/sql/rpl_binlog_sender.cc:778
#8 0x0000000004488502 in Binlog_sender::get_binlog_end_pos (this=0x7fffd815eb60, reader=0x7fffd815e970, end_pos=0x7fffd815e440) at /newdata/mysql-8.0.23/sql/rpl_binlog_sender.cc:557
#9 0x0000000004488201 in Binlog_sender::send_binlog (this=0x7fffd815eb60, reader=0x7fffd815e970, start_pos=125) at /newdata/mysql-8.0.23/sql/rpl_binlog_sender.cc:511
#10 0x0000000004487b95 in Binlog_sender::run (this=0x7fffd815eb60) at /newdata/mysql-8.0.23/sql/rpl_binlog_sender.cc:412
#11 0x0000000004483604 in mysql_binlog_send (thd=0x7fff20015540, log_ident=0x7fffd815f700 "", pos=4, slave_gtid_executed=0x7fffd815f540, flags=2) at /newdata/mysql-8.0.23/sql/rpl_master.cc:1017
#12 0x00000000044834c9 in com_binlog_dump_gtid (thd=0x7fff20015540, packet=0x7fff20006f71 "", packet_length=70) at /newdata/mysql-8.0.23/sql/rpl_master.cc:1002
#13 0x000000000323dcc9 in dispatch_command (thd=0x7fff20015540, com_data=0x7fffd8160b40, command=COM_BINLOG_DUMP_GTID) at /newdata/mysql-8.0.23/sql/sql_parse.cc:2048
#14 0x000000000323b513 in do_command (thd=0x7fff20015540) at /newdata/mysql-8.0.23/sql/sql_parse.cc:1320
#15 0x000000000340dfb1 in handle_connection (arg=0xa2d4880) at /newdata/mysql-8.0.23/sql/conn_handler/connection_handler_per_thread.cc:301
#16 0x000000000500dc16 in pfs_spawn_thread (arg=0xa524270) at /newdata/mysql-8.0.23/storage/perfschema/pfs.cc:2900
#17 0x00007ffff7bc6ea5 in start_thread () from /lib64/libpthread.so.0
#18 0x00007ffff61418dd in clone () from /lib64/libc.so.6
(gdb) f 5
#5 0x00000000043cb6a6 in MYSQL_BIN_LOG::wait_for_update (this=0x7e0eb60 <mysql_bin_log>, timeout=0x7fffd815e280) at /newdata/mysql-8.0.23/sql/binlog.cc:7557
7557 ret = mysql_cond_timedwait(&update_cond, &LOCK_binlog_end_pos,
(gdb) p timeout
$3 = (const timespec *) 0x7fffd815e280
(gdb) p *timeout
$4 = {tv_sec = 1729517575, tv_nsec = 413257724}
(gdb) f 6
#6 0x000000000448acaf in Binlog_sender::wait_with_heartbeat (this=0x7fffd815eb60, log_pos=316133) at /newdata/mysql-8.0.23/sql/rpl_binlog_sender.cc:796
796 ret = mysql_bin_log.wait_for_update(&ts);
(gdb) p ts
$5 = {tv_sec = 1729517575, tv_nsec = 413257724}
(gdb) p m_heartbeat_period.count()
$6 = 30000001024