在Oracle数据库集群中,有3个资源对于维持数据库服务有着至关重要的作用,分别是网络资源(监听资源)、ASM磁盘组资源、数据库服务资源。往往其中任何一个资源故障后,都会影响对应节点的实例上数据库对外提供服务。
昨天早上到公司不久,发现日常使用的一套数据库上的所有节点上的系统负载降到5.0以下了(前一段时间使用swingbench压力测试工具对数据库进行压测),压测期间系统负载基本维持在100左右,差不多消耗掉了CPU所有的性能。
但是早上看到的情况完全不一样,系统负载降到不到5.0,CPU使用率也不到10%,然后登录到swingbench工具的窗口,看到了非常奇怪的一幕,swingbench工具并没有出现hung死的现象,但是swingbench中反应业务 TPS 和 TPM 的曲线统统没有数据,反应数据库延迟的Response相应时间的曲线变得特别高(响应时间大概在13w-15w的样子)。
随后登录了数据库实例所在的节点,查看了数据库的状态和ASM磁盘组的状态,数据库实例和ASM实例的状态都是正常的,而且ASM实例没有发生掉盘事件。随后对事件进行了分析,以下是分析过程:
操作系统层面
一般应用程序出现问题后,或多或少和离不开操作系统的关系。因此,我也是首先查看了操作系统有没有发生过重启,通过 uptime
查看到数据库实例所在的节点已经运行了好几天,显然,并不是因为操作系统重启导致的;
# uptime
09:44:10 up 3 days, 49 min, 1 user, load average: 0.570, 0.510, 1.29
紧接着查看了内存使用情况,早上查看到的内存因为数据库已经没有压力,内存也释放了一部分,从目前的使用率来讲看不出来什么;但是考虑到应用程序在服务过程汇总可能会发生内存泄漏的问题,我有查看了系统是否发生过crash(一般应用程序占据大量内存时,操作系统内存不够用的情况下会触发操作系统crash来强制中断消耗内存的进程),通过 ls /var/crash/
查看是否存在前天晚上的crash日志,然而也没有crash日志,说明内存是可用的,并不是因为内存引起的数据库延迟响应高
# free -lh
total used free shared buff/cache available
Mem: 124G 76G 32G 7.2G 16G 35G
Low: 124G 96G 28G
High: 0B 0B 0B
Swap: 15G 0B 15G
# ls /var/crash/
随后查看了系统空间使用率,安装Orale数据库软件的目录空间也是正常的
# df -lh
Filesystem Size Used Avail Use% Mounted on
...
/dev/rhel/rhel-opt 100G 78G 23G 78% /opt
...
数据库层面
操作系统上检查了CPU、内存、目录空间外,并没有发现有什么异常。既然操作系统上没有出现故障,那么数据库响应延迟的问题很可能出现在数据库上了。以下是数据库上的分析过程。
首先还是查看数据库状态和ASM磁盘组资源状态,看起来没什么问题
#crsctl stat res -t
...
ora.rac.db
1 ONLINE ONLINE node1 Open
2 ONLINE ONLINE node2 Open
# asmcmd lsdg
State Type Rebal Sector Logical_Sector Block AU Total_MB Free_MB Req_mir_free_MB Usable_file_MB Offline_disks Voting_files Name
MOUNTED NORMAL N 512 512 4096 1048576 614400 651 204800 -102074 0 N LOGDG/
MOUNTED NORMAL N 512 512 4096 1048576 31561244 377434 10767954 -5195260 0 N DATADG/
MOUNTED NORMAL N 512 512 4096 4194304 18432 17224 6144 5540 0 Y OCRVOTE/
随后使用oracle用户登录,查看数据库下的告警日志(alert.log),日志中从前天晚上8点48左右开始出现报错,大致意思是LOGDG对应的磁盘组没有剩余可用空间了,空间耗尽了。
2022-03-10T20:37:59.178535+08:00
Thread 1 advanced to log sequence 4315 (LGWR switch), current SCN: 27195258674
Current log# 10 seq# 4315 mem# 0: +LOGDG/RAC/ONLINELOG/group_10.265.1095247821
2022-03-10T20:40:00.952685+08:00
ARC3 (PID:2830): Archived Log entry 21 added for T-1.S-4314 ID 0x9fed2859 LAD:1
2022-03-10T20:48:24.680608+08:00
Thread 1 advanced to log sequence 4316 (LGWR switch), current SCN: 27204160847
Current log# 11 seq# 4316 mem# 0: +LOGDG/RAC/ONLINELOG/group_11.266.1095247855
2022-03-10T20:48:25.734541+08:00
ARC0 (PID:2822): Unable to create archive log file '+LOGDG'
2022-03-10T20:48:25.759606+08:00
Errors in file /opt/oracle/diag/rdbms/rac/rac1/trace/rac1_arc0_2822.trc:
ORA-19504: failed to create file "+LOGDG"
ORA-17502: ksfdcre:4 Failed to create file +LOGDG
ORA-15041: diskgroup "LOGDG" space exhausted
ARC0 (PID:2822): Error 19504 Creating archive log file to '+LOGDG'
ARC0 (PID:2822): Stuck archiver: insufficient local LADs
ARC0 (PID:2822): Stuck archiver condition declared
2022-03-10T20:48:25.905755+08:00
Errors in file /opt/oracle/diag/rdbms/rac/rac1/trace/rac1_arc0_2822.trc:
ORA-16038: log 10 sequence# 4315 cannot be archived
ORA-19504: failed to create file ""
ORA-00312: online log 10 thread 1: '+LOGDG/RAC/ONLINELOG/group_10.265.1095247821'
ARC0 (PID:2822): Archival error occurred on a closed thread, archiver continuing
2022-03-10T20:48:25.905894+08:00
ORACLE Instance rac1, archival error, archiver continuing
并且在数据库的alert告警日志中看到了一些 Unable to create archive log file 的打印,这里就有点奇怪了,这套数据库之前使用的过程中并没有开启过归档,而日志中的打印确实是创建归档文件失败。
尽管如此,我还是先看了ASM磁盘组的状态。再次看ASM磁盘组时,果然发现了一些问题,LOGDG对应的磁盘组可用空间(Usable_file_MB)变成负的了
# asmcmd lsdg
State Type Rebal Sector Logical_Sector Block AU Total_MB Free_MB Req_mir_free_MB Usable_file_MB Offline_disks Voting_files Name
MOUNTED NORMAL N 512 512 4096 1048576 614400 651 204800 -102074 0 N LOGDG/
MOUNTED NORMAL N 512 512 4096 1048576 31561244 377434 10767954 -5195260 0 N DATADG/
MOUNTED NORMAL N 512 512 4096 4194304 18432 17224 6144 5540 0 Y OCRVOTE/
这个时候我并不清楚LOGDG对应的ASM磁盘组空间为什么会变成负的,重新以oracle用户登录进入数据库中,查看redo日志组的使用情况,大部分的redo日志组处于非活跃状态(INACTIVE),以为是因为redo日志暴增导致的磁盘组空间溢出,因此准备删除几组redo日志。根据正常情况下,Oracle数据库中的INACTIVE状态的redo log日志文件是可以被删除的,但是这次删除的时候,却报错了。
> set linesize 2000
> select * from v$log;
GROUP# THREAD# SEQUENCE# BYTES BLOCKSIZE MEMBERS ARC STATUS FIRST_CHANGE# FIRST_TIME NEXT_CHANGE# NEXT_TIME CON_ID
---------- ---------- ---------- ---------- ---------- ---------- --- ---------------- ------------- ------------------- ------------ ------------------- ----------
6 1 4320 1.0737E+10 512 1 NO INACTIVE 2.7240E+10 2022-03-10 21:30:25 2.7249E+10 2022-03-10 21:40:56 0
7 1 4321 1.0737E+10 512 1 NO INACTIVE 2.7249E+10 2022-03-10 21:40:56 2.7258E+10 2022-03-10 21:51:22 0
8 1 4322 1.0737E+10 512 1 NO INACTIVE 2.7258E+10 2022-03-10 21:51:22 2.7267E+10 2022-03-10 22:01:42 0
9 1 4323 1.0737E+10 512 1 NO CURRENT 2.7267E+10 2022-03-10 22:01:42 9.2954E+18 0
10 1 4315 1.0737E+10 512 1 NO INACTIVE 2.7195E+10 2022-03-10 20:37:59 2.7204E+10 2022-03-10 20:48:24 0
11 1 4316 1.0737E+10 512 1 NO INACTIVE 2.7204E+10 2022-03-10 20:48:24 2.7213E+10 2022-03-10 20:58:59 0
12 1 4317 1.0737E+10 512 1 NO INACTIVE 2.7213E+10 2022-03-10 20:58:59 2.7222E+10 2022-03-10 21:09:37 0
13 1 4318 1.0737E+10 512 1 NO INACTIVE 2.7222E+10 2022-03-10 21:09:37 2.7231E+10 2022-03-10 21:20:02 0
14 1 4319 1.0737E+10 512 1 NO INACTIVE 2.7231E+10 2022-03-10 21:20:02 2.7240E+10 2022-03-10 21:30:25 0
16 2 4476 1.0737E+10 512 1 NO CURRENT 2.7259E+10 2022-03-10 21:52:04 9.2954E+18 0
17 2 4468 1.0737E+10 512 1 NO INACTIVE 2.7187E+10 2022-03-10 20:27:59 2.7196E+10 2022-03-10 20:38:25 0
18 2 4469 1.0737E+10 512 1 NO INACTIVE 2.7196E+10 2022-03-10 20:38:25 2.7205E+10 2022-03-10 20:49:06 0
19 2 4470 1.0737E+10 512 1 NO INACTIVE 2.7205E+10 2022-03-10 20:49:06 2.7214E+10 2022-03-10 20:59:56 0
20 2 4471 1.0737E+10 512 1 NO INACTIVE 2.7214E+10 2022-03-10 20:59:56 2.7223E+10 2022-03-10 21:10:21 0
21 2 4472 1.0737E+10 512 1 NO INACTIVE 2.7223E+10 2022-03-10 21:10:21 2.7232E+10 2022-03-10 21:20:47 0
22 2 4473 1.0737E+10 512 1 NO INACTIVE 2.7232E+10 2022-03-10 21:20:47 2.7241E+10 2022-03-10 21:31:12 0
23 2 4474 1.0737E+10 512 1 NO INACTIVE 2.7241E+10 2022-03-10 21:31:12 2.7250E+10 2022-03-10 21:41:35 0
24 2 4475 1.0737E+10 512 1 NO INACTIVE 2.7250E+10 2022-03-10 21:41:35 2.7259E+10 2022-03-10 21:52:04 0
根据删除时候提示的报错,redo日志需要归档才能删除。也就是说,虽然现在数据库中的redo log日志文件已经处于INACTIVE状态,但是并没有经过归档,此时日志文件是无法删除的。
> alter database drop logfile group 6;
alter database drop logfile group 6
*
ERROR at line 1:
ORA-00350: log 6 of instance rac1 (thread 1) needs to be archived
ORA-00312: online log 6 thread 1: '+LOGDG/RAC/ONLINELOG/group_6.261.1095247683'
> alter database drop logfile group 24;
alter database drop logfile group 24
*
ERROR at line 1:
ORA-00350: log 24 of instance rac2 (thread 2) needs to be archived
ORA-00312: online log 24 thread 2: '+LOGDG/RAC/ONLINELOG/group_24.279.1095248311'
果然,和数据库告警日志中的报错重合了,是因为数据库归档出的问题(这个时候才反应过来,一方面是因为归档不是我开的,所以没往那想)。在数据库中查看归档的配置,没有问题,数据库确实开了归档(log_archive_dest_1),而且保存归档的路径正好是LOGDG。
LOGDG磁盘组是为了存放redo日志创建的一个磁盘组,本身的容量并不是很大,加上之前redo文件已经消耗了一部分空间,剩余的可用空间本来也不多,这下开了数据库归档,更是雪上加霜了,直接导致磁盘组空间溢出了。
这里也能解释为什么数据库状态正常的情况下,数据库的延迟响应会特别高。存放redo日志的磁盘组本身空间已经溢出,导致新产生的数据库事务对应的redo日志无法写入到对应的磁盘组中(磁盘组空间炸了),而根据Oracle的redo机制(日志先行)写不下去日志的情况下,应该是触发了数据库自身的保护机制,将业务hung住,提高数据库的响应时间来保证事务所产生的日志能够被记录上。猜想是写不下去日志就不让新事务产生
09:58:17 sys@ rac1>show parameter log_archive
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
log_archive_config string
log_archive_dest string
log_archive_dest_1 string location=+LOGDG
log_archive_dest_2 string
log_archive_dest_3 string
...
log_archive_dest_state_1 string enable
log_archive_dest_state_2 string enable
log_archive_dest_state_3 string enable
查看数据库redo日志的归档情况,发现只有两组日志是归档的,其他的redo日志组都是没有归档过的
> select * from v$archive;
GROUP# THREAD# SEQUENCE# ISC CUR FIRST_CHANGE# CON_ID
---------- ---------- ---------- --- --- ------------- ----------
6 1 4320 NO NO 2.7240E+10 0
7 1 4321 NO NO 2.7249E+10 0
8 1 4322 NO NO 2.7258E+10 0
9 1 4323 YES YES 2.7267E+10 0
10 1 4315 NO NO 2.7195E+10 0
11 1 4316 NO NO 2.7204E+10 0
12 1 4317 NO NO 2.7213E+10 0
13 1 4318 NO NO 2.7222E+10 0
14 1 4319 NO NO 2.7231E+10 0
16 2 4476 YES YES 2.7259E+10 0
17 2 4468 NO NO 2.7187E+10 0
18 2 4469 NO NO 2.7196E+10 0
19 2 4470 NO NO 2.7205E+10 0
20 2 4471 NO NO 2.7214E+10 0
21 2 4472 NO NO 2.7223E+10 0
22 2 4473 NO NO 2.7232E+10 0
23 2 4474 NO NO 2.7241E+10 0
24 2 4475 NO NO 2.7250E+10 0
显然,问题原因已经找到了。剩下的就是数据库的恢复工作了。
数据库恢复
既然是因为归档日志导致的空间溢出,那么首要的就是删除归档文件,关闭归档,两者没有先后之分。
进入ASM磁盘组中,GI中提供了asmcmd命令行工具对asm磁盘组进行管理,找到ARCHIVELOG目录,删除其下的归档日志文件,释放磁盘组空间
# asmcmd
ASMCMD> ls
LOGDG/
DATADG/
OCRVOTE/
ASMCMD> cd LOGDG/
ASMCMD> ls
ASM/
RAC/
ASMCMD> cd RAC/
ASMCMD> ls
ARCHIVELOG/
ONLINELOG/
redo01.log
redo02.log
redo03.log
redo04.log
ASMCMD> cd ARCHIVELOG/
ASMCMD> ls
2022_03_09/
2022_03_10/
ASMCMD> cd 2022_03_09/
ASMCMD> ls
thread_1_seq_4224.260.1098891701
thread_1_seq_4225.280.1098893501
thread_1_seq_4235.283.1098897523
thread_1_seq_4237.285.1098898563
thread_2_seq_4379.270.1098891701
thread_2_seq_4380.281.1098893501
thread_2_seq_4387.282.1098897519
thread_2_seq_4390.284.1098898561
ASMCMD> rm thread_1_seq_4224.260.1098891701
ASMCMD> rm thread_1_seq_4225.280.1098893501
ASMCMD> rm thread_1_seq_4235.283.1098897523
ASMCMD> rm thread_1_seq_4237.285.1098898563
ASMCMD> rm thread_2_seq_4379.270.1098891701
ASMCMD> rm thread_2_seq_4380.281.1098893501
ASMCMD> rm thread_2_seq_4387.282.1098897519
ASMCMD> rm thread_2_seq_4390.284.1098898561
ASMCMD> ls
ASMCMD-8002: entry '2022_03_09' does not exist in directory '+LOGDG/RAC/ARCHIVELOG/'
ASMCMD> cd ..
ASMCMD> ls
2022_03_10/
ASMCMD> cd 2022_03_10/
ASMCMD> ls
thread_1_seq_4308.286.1098992135
thread_1_seq_4309.288.1098992773
thread_1_seq_4310.290.1098993385
thread_1_seq_4311.292.1098993999
thread_1_seq_4312.294.1098994625
thread_1_seq_4313.296.1098995251
thread_1_seq_4314.298.1098995879
thread_2_seq_4462.287.1098992135
thread_2_seq_4463.289.1098992777
thread_2_seq_4464.291.1098993405
thread_2_seq_4465.293.1098994031
thread_2_seq_4466.295.1098994657
thread_2_seq_4467.297.1098995279
ASMCMD> ls -lh
Type Redund Striped Time Sys Name
ARCHIVELOG MIRROR COARSE MAR 10 19:00:00 Y thread_1_seq_4308.286.1098992135
ARCHIVELOG MIRROR COARSE MAR 10 19:00:00 Y thread_1_seq_4309.288.1098992773
ARCHIVELOG MIRROR COARSE MAR 10 19:00:00 Y thread_1_seq_4310.290.1098993385
ARCHIVELOG MIRROR COARSE MAR 10 20:00:00 Y thread_1_seq_4311.292.1098993999
ARCHIVELOG MIRROR COARSE MAR 10 20:00:00 Y thread_1_seq_4312.294.1098994625
ARCHIVELOG MIRROR COARSE MAR 10 20:00:00 Y thread_1_seq_4313.296.1098995251
ARCHIVELOG MIRROR COARSE MAR 10 20:00:00 Y thread_1_seq_4314.298.1098995879
ARCHIVELOG MIRROR COARSE MAR 10 19:00:00 Y thread_2_seq_4462.287.1098992135
ARCHIVELOG MIRROR COARSE MAR 10 19:00:00 Y thread_2_seq_4463.289.1098992777
ARCHIVELOG MIRROR COARSE MAR 10 19:00:00 Y thread_2_seq_4464.291.1098993405
ARCHIVELOG MIRROR COARSE MAR 10 20:00:00 Y thread_2_seq_4465.293.1098994031
ARCHIVELOG MIRROR COARSE MAR 10 20:00:00 Y thread_2_seq_4466.295.1098994657
ARCHIVELOG MIRROR COARSE MAR 10 20:00:00 Y thread_2_seq_4467.297.1098995279
ASMCMD> rm thread_1_seq_4308.286.1098992135
ASMCMD> rm thread_1_seq_4309.288.1098992773
ASMCMD> rm thread_1_seq_4310.290.1098993385
ASMCMD> rm thread_1_seq_4311.292.1098993999
ASMCMD> rm thread_1_seq_4313.296.1098995251
ASMCMD> rm thread_1_seq_4312.294.1098994625
ASMCMD> rm thread_1_seq_4314.298.1098995879
ASMCMD> rm thread_2_seq_4463.289.1098992777
ASMCMD> rm thread_2_seq_4464.291.1098993405
ASMCMD> rm thread_2_seq_4465.293.1098994031
ASMCMD> rm thread_2_seq_4466.295.1098994657
ASMCMD> rm thread_2_seq_4467.297.1098995279
ASMCMD> rm thread_2_seq_4462.287.1098992135
ASMCMD> exit
# asmcmd lsdg
State Type Rebal Sector Logical_Sector Block AU Total_MB Free_MB Req_mir_free_MB Usable_file_MB Offline_disks Voting_files Name
MOUNTED NORMAL N 512 512 4096 1048576 614400 206870 204800 1035 0 N LOGDG/
MOUNTED NORMAL N 512 512 4096 1048576 31561244 377434 10767954 -5195260 0 N DATADG/
MOUNTED NORMAL N 512 512 4096 4194304 18432 17224 6144 5540 0 Y OCRVOTE/
删除掉数据库的归档日志后,再次查看磁盘组,可以看到LOGDG对应的磁盘组可用空间被释放出来了。接着关掉数据库的归档(不关归档估计撑不过一天磁盘组又满了),关闭数据库,将数据库启动到mount状态(所有的数据库实例节点),使用 alter database noarchivelog;
关闭数据库归档
删除数据库归档需要在数据库状态为mount状态下进行,open状态下的数据库无法完成归档开启/关闭操作
> archive log list ;
Database log mode Archive Mode
Automatic archival Enabled
Archive destination +LOGDG
Oldest online log sequence 4315
Next log sequence to archive 4315
Current log sequence 4323
> alter database noarchivelog;
alter database noarchivelog
*
ERROR at line 1:
ORA-01126: database must be mounted in this instance and not open in any instance
> shutdown immediate;
Database closed.
Database dismounted.
ORACLE instance shut down.
> startup mount
ORACLE instance started.
Total System Global Area 5.3687E+10 bytes
Fixed Size 37223248 bytes
Variable Size 9126805504 bytes
Database Buffers 4.4426E+10 bytes
Redo Buffers 96993280 bytes
Database mounted.
> alter database noarchivelog;
Database altered.
> archive log list
Database log mode No Archive Mode
Automatic archival Disabled
Archive destination +LOGDG
Oldest online log sequence 4316
Current log sequence 4324
关闭归档后,将数据库启动到Open状态
> alter database open;
Database altered.
至此,数据库恢复正常,通过swingbench曲线看到表示数据库事务的TPS 和 TPM数据曲线恢复正常。
参考:数据库关闭归档部分参考 Oracle数据库开启与关闭归档模式