9i rac 脑裂导致节点重启

同事今天发过来一个 9i rac 某节点重启case的一些trace,大概的分析了一下,如下:

首先我们来看1节点的alert log,如下:
Tue Oct 18 11:03:00 2011
Timed out trying to start process P001.
Tue Oct 18 11:06:34 2011
Evicting instance 2 from cluster
Tue Oct 18 11:06:38 2011
Reconfiguration started (old inc 18, new inc 19)
List of nodes:
 0
Tue Oct 18 11:06:38 2011
Reconfiguration started (old inc 18, new inc 20)
我们可以看到在11:06:34 cluster开始驱逐2节点。下面我们来看2节点的信息:
Tue Oct 18 10:51:57 2011
IPC Send timeout detected. Sender ospid 22239
Tue Oct 18 10:51:59 2011
Communications reconfiguration: instance 0
Tue Oct 18 10:52:03 2011
ARC0: Completed archiving  log 6 thread 2 sequence 87953
Tue Oct 18 10:53:45 2011
Waiting for clusterware split-brain resolution
Tue Oct 18 11:02:14 2011
Timed out trying to start process P001.
Tue Oct 18 11:03:45 2011
USER: terminating instance due to error 481
Tue Oct 18 11:03:45 2011
Errors in file /orarac/app/oracle/admin/siebdb/bdump/siebdb2_lgwr_15116.trc:
ORA-00481: LMON process terminated with error
Tue Oct 18 11:03:45 2011

Tue Oct 18 11:07:56 2011
Reconfiguration started (old inc 0, new inc 21)
List of nodes:
 0 1
 Global Resource Directory frozen
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
 Resources and enqueues cleaned out
 Resources remastered 0
 0 GCS shadows traversed, 0 cancelled, 0 closed
 0 GCS resources traversed, 0 cancelled
 set master node info
 Submitted all remote-enqueue requests
 Update rdomain variables
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 0 GCS shadows traversed, 0 replayed, 0 unopened
 Submitted all GCS remote-cache requests
Tue Oct 18 11:08:19 2011
 0 write requests issued in 1511603 GCS resources
 436 PIs marked suspect, 0 flush PI msgs
Tue Oct 18 11:08:22 2011
Reconfiguration complete
从2节点的alert 信息可以得到如下一个时间关系:

10:51:57 --> IPC Send 超时
10:51:59 --> 通信失败 cluster尝试关闭实例2 (instance 0 其中0表示关闭)
10:53:45 --> 出现脑裂
11:02:14 --> 尝试启动p001进程(从这里推测,前面的22239应该是P001进程)
11:03:45 --> 实例终止,随后出现ORA-00481错误

从前面的分析来看,可以断定是脑裂导致实例2被crash,那么是什么原因导致脑裂的呢?

最后我们来看alert中提到的lmon 进程的trace信息:
*** SESSION ID:(3.1) 2011-10-18 07:08:57.162
GES IPC: Receivers 10  Senders 10
GES IPC: Buffers  Receive 1000  Send (i:6830 b:6830) Reserve 1000
GES IPC: Msg Size  Regular 440  Batch 2048
Batch msg size = 2048
Batching factor: enqueue replay 47, ack 52
Batching factor: cache replay 29 size per lock 64
kjxggin: receive buffer size = 32768
kjxgmin: SKGXN ver (2 0 Hewlett-Packard SKGXN 2.0)
*** 2011-10-18 07:09:01.648
kjxgmrcfg: Reconfiguration started, reason 1
kjxgmcs: Setting state to 0 0.
*** 2011-10-18 07:09:01.649
 Name Service frozen
kjxgmcs: Setting state to 0 1.
kjfcpiora: publish my weight 3122016
kjxgmps: proposing substate 2
kjxgmcs: Setting state to 18 2.
 Performed the unique instance identification check
kjxgmps: proposing substate 3
kjxgmcs: Setting state to 18 3.
 Name Service recovery started
 Deleted all dead-instance name entries
kjxgmps: proposing substate 4
kjxgmcs: Setting state to 18 4.
 Multicasted all local name entries for publish
 Replayed all pending requests
kjxgmps: proposing substate 5
kjxgmcs: Setting state to 18 5.
 Name Service normal
 Name Service recovery done
*** 2011-10-18 07:09:02.652
kjxgmps: proposing substate 6
kjxgmcs: Setting state to 18 6.
*** 2011-10-18 07:09:02.652
*** 2011-10-18 07:09:02.653
Reconfiguration started (old inc 0, new inc 18)
Synchronization timeout interval: 660 sec  ====> 同步超时时间660s
List of nodes:
 0 1                                       ====> cluster node编号
 Global Resource Directory frozen
node 0
release 9 2 0 7                            ====> oracle version
node 1
release 9 2 0 7
res_master_weight for node 0 is 3122016
res_master_weight for node 1 is 3122016
 Total master weight = 6244032
 Dead  inst
 Join  inst 0 1
 Exist inst
 Active Sendback Threshold = 50 %
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
 Resources and enqueues cleaned out
 Resources remastered 0
 0 GCS shadows traversed, 0 cancelled, 0 closed
 0 GCS resources traversed, 0 cancelled
 set master node info
 Submitted all remote-enqueue requests
kjfcrfg: Number of mesgs sent to node 0 = 0
 Update rdomain variables
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
*** 2011-10-18 07:09:11.515
 0 GCS shadows traversed, 0 replayed, 0 unopened
 Submitted all GCS cache requests
 0 write requests issued in 1518892 GCS resources
 10 PIs marked suspect, 0 flush PI msgs
*** 2011-10-18 07:09:27.724
Reconfiguration complete
*** 2011-10-18 07:09:35.232
kjxgrtmc2: Member 1 thread 2 mounted
*** 2011-10-18 10:51:59.470                   ====> 这个时间点很重要
kjxgrcomerr: Communications reconfig: instance 0 (18,18)
Submitting asynchronized dump request [2]
kjxgrrcfgchk: Initiating reconfig, reason 3   ====> 通信失败
*** 2011-10-18 10:52:04.499
kjxgmrcfg: Reconfiguration started, reason 3
kjxgmcs: Setting state to 18 0.
*** 2011-10-18 10:52:04.504
 Name Service frozen
kjxgmcs: Setting state to 18 1.
*** 2011-10-18 10:52:04.514
Obtained RR update lock for sequence 18, RR seq 18
*** 2011-10-18 10:53:45.311                   ====> 出现脑裂
kjxgrrecp2: Waiting for split-brain resolution, upd 1, seq 19
*** 2011-10-18 11:03:45.315
Voting results, upd 1, seq 19, bitmap: 0
*** 2011-10-18 11:03:45.315
kjxgrdtrt: Evicted by 1, seq (19, 19)
IMR state information                         ====> 触发IMR
 Member 1, thread 2, state 4, flags 0x00a1
 RR seq 19, propstate 3, pending propstate 0
 rcfg rsn 3, rcfg time 21546700, mem ct 2
 master 1, master rcfg time 21546700

 Member information:
 Member 0, incarn 18, version 1840361946
 thrd 1, prev thrd 65535, status 0x0007, err 0x0002
 valct 0
 Member 1, incarn 18, version 816352
 thrd 2, prev thrd 65535, status 0x0107, err 0x0000
 valct 2

Group name: SIEBDB
Member id: 1
Cached SKGXN event: 0
Group State:
 State: 18 1
 Commited Map: 0 1
 New Map: 0 1
 SKGXN Map: 0 1
 Master node: 0
 Memcnt 2  Rcvcnt 0
 Substate Proposal: false
Inc Proposal:
 incarn 0  memcnt 0  master 0
 proposal false  matched false
 map:
Master Inc State:
 incarn 0  memcnt 0  agrees 0  flag 0x1
 wmap:
 nmap:
 ubmap:
Submitting asynchronized dump request [1]
error 29740 detected in background process
ORA-29740: evicted by member 1, group incarnation 19
从整个时间点来看,我们可以发现整个触发脑裂到脑裂处理完成的时间超不多15分钟。
大约时间点是这样:10:51 - 11:07

关于reconfigurtion reason 有如下几种解释:

Reason 0 = No reconfiguration
Reason 1 = The Node Monitor generated the reconfiguration.
Reason 2 = An instance death was detected.
Reason 3 = Communications Failure
Reason 4 = Reconfiguration after suspend

如果说前面那个660s就是心跳超时时间的阈值的话,那么就不会出现这个故障。
据同事说,该rac cluster节点时间相差3分钟,在当前凌晨,硬件厂商在换电池。

这样来看,基本上可以断定就是由于时间不同步导致脑裂,我们知道,在有HA等第三方cluster
软件时,io fencing是由其来完成的,所以这里我还不知道具体这个阈值是多少?

换句话说,就是超过多少秒?才出现脑裂呢?难得是前面的660s?

最后补充一下,lmon主要是检查如下两种心跳;

1. 网络心跳 说白了,就是指cluster心跳(应该是通过ping机制来完成的)
2. controlfile心跳  ckpt进程会每隔3s去更新一次controlfile,用此来判断。


评论

《 “9i rac 脑裂导致节点重启” 》 有 3 条评论

  1. 顶ROGER,加油,是

  2. 再顶

  3. Its like you read my mind! You seem to know so much about this, like you wrote the book in it or something. I think that you can do with a few pics to drive the message home a bit, but other than that, this is wonderful blog. A great read. I’ll certainly be back.

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注