troubleshooting instance evicted due to ORA error 29740

本案例来自西区某客户,数据库版本为rac 19.16,2节点的db实例被驱逐,报错ORA-29740

由于2节点被驱逐,所以首先看看2节点alert:

2023-11-06T22:24:38.077225+08:00
Detected an inconsistent instance membership by instance 1
Errors in file /oracle/app/oracle/diag/rdbms/racdb/racdb2/trace/racdb2_lmon_363356.trc  (incident=4320189):
ORA-29740: evicted by instance number 1, group incarnation 13
Incident details in: /oracle/app/oracle/diag/rdbms/racdb/racdb2/incident/incdir_4320189/racdb2_lmon_363356_i4320189.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
2023-11-06T22:24:38.976251+08:00
Errors in file /oracle/app/oracle/diag/rdbms/racdb/racdb2/trace/racdb2_lmon_363356.trc:
ORA-29740: evicted by instance number 1, group incarnation 13
Errors in file /oracle/app/oracle/diag/rdbms/racdb/racdb2/trace/racdb2_lmon_363356.trc  (incident=4320190):
ORA-29740 [] [] [] [] [] [] [] [] [] [] [] []
Incident details in: /oracle/app/oracle/diag/rdbms/racdb/racdb2/incident/incdir_4320190/racdb2_lmon_363356_i4320190.trc
2023-11-06T22:24:39.693332+08:00
LMON (ospid: 363356): terminating the instance due to ORA error 29740

22:24:38可以看到2节点被1节点驱逐,lmon产生了trace,我没去看,因为CGS层面的db实例的驱逐本身就是lmon进程去实现的。因为是1节点驱逐的2节点,那么1节点的alert日志是不能漏掉的。

1节点alert:

2023-11-06T22:23:49.624237+08:00
IPC Send timeout to 2.2 inc 11 for msg type 36 from opid 61
2023-11-06T22:23:49.626416+08:00
Communications reconfiguration: instance_number 2 from SMON (ospid: 390100)
2023-11-06T22:24:38.081438+08:00
Detected an inconsistent instance membership by instance 1
Evicting instance 2 from cluster
Waiting for instances to leave: 2
2023-11-06T22:24:46.157136+08:00
Increasing priority of 5 RS
Reconfiguration started (old inc 11, new inc 15)

可以看到22:23:49就出现了ipc send timeout,并且发现触发reconfig的原因与SMON有关系,所以查看SMON trace可以发现

*** 2023-11-06T22:23:49.624060+08:00
IPCLW:[0.1]{-}[WAIT]:UTIL: [1699280629623930]  ACNH 0x7fb1108ee260 State: 3 SMSN: 1043354486 PKT(1043402479.1188389702) # Pending: 1
IPCLW:[0.2]{-}[WAIT]:UTIL: [1699280629623930]   Peer: LMS1.KSMSQ_dlm.363374 AckSeq: 1188389701
IPCLW:[0.3]{-}[WAIT]:UTIL: [1699280629623930]   Flags: 0x40000000 IHint: 0x501e46a00000019 THint: 0x5863c1f300000010
IPCLW:[0.4]{-}[WAIT]:UTIL: [1699280629623930]   Local Address: 169.254.10.192:33365 Remote Address: 169.254.5.139:18617
IPCLW:[0.5]{-}[WAIT]:UTIL: [1699280629623930]   Remote PID: ver 0 flags 1 trans 2 tos 0 opts 0 xdata3 41 xdata2 c135a0d7
IPCLW:[0.6]{-}[WAIT]:UTIL: [1699280629623930]             : mmsz 8288 mmr 9200 mms 2 xdata 3a937d6
IPCLW:[0.7]{-}[WAIT]:UTIL: [1699280629623930]   IVPort: 40790 TVPort: 14294 IMPT: 21890 RMPT: 65   Pending Sends: No Unacked Sends: Yes
IPCLW:[0.8]{-}[WAIT]:UTIL: [1699280629623930]   Send Engine Queued: No sshdl -1 ssts 0 rtts 0 snderrchk 0 creqcnt 1 credits 7/8
IPCLW:[0.9]{-}[WAIT]:UTIL: [1699280629623930]   Unackd Messages 1043402478 -> 1043402478. SSEQ 1188389701 Send Time: 0:5:0.653.653784 SMSN # Xmits: 1 EMSN 0:5:0.653.653784
IPCLW:[0.10]{-}[WAIT]:UTIL: [1699280629623930]    [0] mbuf 0x7fb11073eaf0 MSN 1043402478 Seq 1188389701 -> 1188389702 Send Time: 0:5:0.653.653784 # XMits: 1
2023-11-06 22:23:49.624 : GSIPC:MSGQCB: msg 0x342388d50 status 1.20, type 36, dest 2.2 seq 0.33419001
2023-11-06 22:23:49.624 : GSIPC:MSGQCB: msg 0x342388d50 send failed inc 11 waited 300653848 usec
2023-11-06 22:23:49.624 : GSIPC:MSGQCB: dest_inc 11  sys_inc 11

IPCLW层面出现了超时,报错的时间与IPC timeout时间一致,GSIPC:MSGQCB: msg 0x342388d50 send failed inc 11 waited 300653848 usec,说明已经超时了300秒。

看看SMON的ASH:

select program,sql_exec_start,event,p1,p2,count(*) from dba_hist_active_sess_history where sample_time between to_date('2023-11-06 21:30:00','yyyy-mm-dd hh24:mi:ss')
    and to_date ('2023-11-06 23:30:00','yyyy-mm-dd hh24:mi:ss') and sql_id='679x4qggryd2v' group by program,sql_exec_start,event,p1,p2

  PROGRAM                     SQL_EXEC_START           EVENT                                        P1         P2         COUNT(*)
oracle@racdb1 (SMON)        2023/11/6 22:18:48        gc cr failure                                 3        121167        255
oracle@racdb1 (SMON)        2023/11/6 22:18:48        gc cr request                                 3        121167        68

SMON从22:18:48一直执行sql 679x4qggryd2v,一直在请求同一个远程cr块,并且有大量的gc cr failure 。从2023/11/6 22:18:48开始到22:23:49正好也是trace中报出的超时300秒。

sql文本为:

SELECT smontabv.cnt
	,smontabv.time_mp
	,smontab.scn
	,smontab.num_mappings
	,smontab.tim_scn_map
FROM smon_scn_time smontab
	,(
		SELECT max(scn) scnmax
			,count(*) + sum(NVL2(TIM_SCN_MAP, NUM_MAPPINGS, 0)) cnt
			,max(time_mp) time_mp
		FROM smon_scn_time
		) smontabv
WHERE smontab.scn = smontabv.scnmax

与smon_scn_time的功能有关系,但这应该不是主要原因,主要原因判断还是与私网有关系。

从osw的traceroute发现,对远程节点的traceroute出现了大量的超时。

zzz ***Mon Nov 6 22:18:24 CST 2023
traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets
 1  racdb1-priv (10.10.15.11)  0.024 ms  0.007 ms  0.006 ms
traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets
 1  racdb2-priv (10.10.15.12)  0.048 ms  0.037 ms  0.060 ms
zzz ***Mon Nov 6 22:18:54 CST 2023
traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets
 1  racdb1-priv (10.10.15.11)  0.035 ms  0.006 ms  0.005 ms
traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets
 1  racdb2-priv (10.10.15.12)  0.062 ms  0.094 ms  0.088 ms
zzz ***Mon Nov 6 22:19:24 CST 2023
traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets
 1  racdb1-priv (10.10.15.11)  0.030 ms  0.006 ms  0.005 ms
traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets
 1  racdb2-priv (10.10.15.12)  0.063 ms  0.056 ms *
zzz ***Mon Nov 6 22:19:54 CST 2023
traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets
 1  racdb1-priv (10.10.15.11)  0.031 ms  0.007 ms  0.006 ms
traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets
 1  racdb2-priv (10.10.15.12)  0.059 ms * *
zzz ***Mon Nov 6 22:20:24 CST 2023
traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets
 1  racdb1-priv (10.10.15.11)  0.037 ms  0.011 ms  0.006 ms
traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets
 1  * racdb2-priv (10.10.15.12)  0.074 ms  0.049 ms
zzz ***Mon Nov 6 22:20:54 CST 2023
traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets
 1  racdb1-priv (10.10.15.11)  0.030 ms  0.005 ms  0.005 ms
traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets
 1  racdb2-priv (10.10.15.12)  0.063 ms * *
zzz ***Mon Nov 6 22:21:25 CST 2023
traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets
 1  racdb1-priv (10.10.15.11)  0.029 ms  0.006 ms  0.006 ms
traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets
 1  racdb2-priv (10.10.15.12)  0.049 ms  0.062 ms  0.089 ms
zzz ***Mon Nov 6 22:21:55 CST 2023
traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets
 1  racdb1-priv (10.10.15.11)  0.026 ms  0.006 ms  0.006 ms
traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets
 1  racdb2-priv (10.10.15.12)  0.113 ms * *
zzz ***Mon Nov 6 22:22:25 CST 2023
traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets
 1  racdb1-priv (10.10.15.11)  0.030 ms  0.008 ms  0.006 ms
traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets
 1  racdb2-priv (10.10.15.12)  0.062 ms  0.049 ms  0.045 ms
zzz ***Mon Nov 6 22:22:55 CST 2023
traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets
 1  racdb1-priv (10.10.15.11)  0.031 ms  0.006 ms  0.005 ms
traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets
 1  * racdb2-priv (10.10.15.12)  0.044 ms  0.093 ms
zzz ***Mon Nov 6 22:23:25 CST 2023
traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets
 1  racdb1-priv (10.10.15.11)  0.027 ms  0.006 ms  0.005 ms
traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets
 1  racdb2-priv (10.10.15.12)  0.095 ms  0.078 ms  0.053 ms
zzz ***Mon Nov 6 22:23:55 CST 2023
traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets
 1  racdb1-priv (10.10.15.11)  0.024 ms  0.007 ms  0.006 ms
traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets
 1  * racdb2-priv (10.10.15.12)  0.082 ms  0.084 ms
zzz ***Mon Nov 6 22:24:25 CST 2023
traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets
 1  racdb1-priv (10.10.15.11)  0.029 ms  0.006 ms  0.006 ms
traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets
 1  racdb2-priv (10.10.15.12)  0.057 ms  0.145 ms *
zzz ***Mon Nov 6 22:24:55 CST 2023
traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets
 1  racdb1-priv (10.10.15.11)  0.032 ms  0.006 ms  0.005 ms
traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets
 1  racdb2-priv (10.10.15.12)  0.098 ms  0.091 ms  0.085 ms
zzz ***Mon Nov 6 22:25:25 CST 2023
traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets
 1  racdb1-priv (10.10.15.11)  0.032 ms  0.006 ms  0.006 ms
traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets
 1  racdb2-priv (10.10.15.12)  0.058 ms  0.088 ms  0.041 ms
zzz ***Mon Nov 6 22:25:55 CST 2023
traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets
 1  racdb1-priv (10.10.15.11)  0.033 ms  0.007 ms  0.006 ms
traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets
 1  racdb2-priv (10.10.15.12)  0.056 ms  0.055 ms  0.049 ms
zzz ***Mon Nov 6 22:26:25 CST 2023
traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets
 1  racdb1-priv (10.10.15.11)  0.028 ms  0.006 ms  0.005 ms
traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets
 1  racdb2-priv (10.10.15.12)  0.066 ms  0.059 ms  0.066 ms
zzz ***Mon Nov 6 22:26:55 CST 2023
traceroute to racdb1-priv (10.10.15.11), 30 hops max, 60 byte packets
 1  racdb1-priv (10.10.15.11)  0.038 ms  0.006 ms  0.006 ms
traceroute to racdb2-priv (10.10.15.12), 30 hops max, 60 byte packets
 1  racdb2-priv (10.10.15.12)  0.144 ms *  0.107 ms

message中并未发现异常。但是私网肯定还是有问题的,因为搜了一下alert各个时间点都存在ipc timeout。最终的建议是检查交换机、光纤线和网卡。


评论

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注