9i rac 脑裂导致节点重启

同事今天发过来一个 9i rac 某节点重启case的一些trace，大概的分析了一下，如下：

首先我们来看1节点的alert log，如下：

Tue Oct 18 11:03:00 2011
Timed out trying to start process P001.
Tue Oct 18 11:06:34 2011
Evicting instance 2 from cluster
Tue Oct 18 11:06:38 2011
Reconfiguration started (old inc 18, new inc 19)
List of nodes:
 0
Tue Oct 18 11:06:38 2011
Reconfiguration started (old inc 18, new inc 20)

我们可以看到在11:06:34 cluster开始驱逐2节点。下面我们来看2节点的信息：

Tue Oct 18 10:51:57 2011
IPC Send timeout detected. Sender ospid 22239
Tue Oct 18 10:51:59 2011
Communications reconfiguration: instance 0
Tue Oct 18 10:52:03 2011
ARC0: Completed archiving  log 6 thread 2 sequence 87953
Tue Oct 18 10:53:45 2011
Waiting for clusterware split-brain resolution
Tue Oct 18 11:02:14 2011
Timed out trying to start process P001.
Tue Oct 18 11:03:45 2011
USER: terminating instance due to error 481
Tue Oct 18 11:03:45 2011
Errors in file /orarac/app/oracle/admin/siebdb/bdump/siebdb2_lgwr_15116.trc:
ORA-00481: LMON process terminated with error
Tue Oct 18 11:03:45 2011

Tue Oct 18 11:07:56 2011
Reconfiguration started (old inc 0, new inc 21)
List of nodes:
 0 1
 Global Resource Directory frozen
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
 Resources and enqueues cleaned out
 Resources remastered 0
 0 GCS shadows traversed, 0 cancelled, 0 closed
 0 GCS resources traversed, 0 cancelled
 set master node info
 Submitted all remote-enqueue requests
 Update rdomain variables
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 0 GCS shadows traversed, 0 replayed, 0 unopened
 Submitted all GCS remote-cache requests
Tue Oct 18 11:08:19 2011
 0 write requests issued in 1511603 GCS resources
 436 PIs marked suspect, 0 flush PI msgs
Tue Oct 18 11:08:22 2011
Reconfiguration complete

从2节点的alert 信息可以得到如下一个时间关系：

10:51:57 --> IPC Send 超时
10:51:59 --> 通信失败 cluster尝试关闭实例2 （instance 0 其中0表示关闭）
10:53:45 --> 出现脑裂
11:02:14 --> 尝试启动p001进程（从这里推测，前面的22239应该是P001进程）
11:03:45 --> 实例终止，随后出现ORA-00481错误

从前面的分析来看，可以断定是脑裂导致实例2被crash，那么是什么原因导致脑裂的呢？

最后我们来看alert中提到的lmon 进程的trace信息：

*** SESSION ID:(3.1) 2011-10-18 07:08:57.162
GES IPC: Receivers 10  Senders 10
GES IPC: Buffers  Receive 1000  Send (i:6830 b:6830) Reserve 1000
GES IPC: Msg Size  Regular 440  Batch 2048
Batch msg size = 2048
Batching factor: enqueue replay 47, ack 52
Batching factor: cache replay 29 size per lock 64
kjxggin: receive buffer size = 32768
kjxgmin: SKGXN ver (2 0 Hewlett-Packard SKGXN 2.0)
*** 2011-10-18 07:09:01.648
kjxgmrcfg: Reconfiguration started, reason 1
kjxgmcs: Setting state to 0 0.
*** 2011-10-18 07:09:01.649
 Name Service frozen
kjxgmcs: Setting state to 0 1.
kjfcpiora: publish my weight 3122016
kjxgmps: proposing substate 2
kjxgmcs: Setting state to 18 2.
 Performed the unique instance identification check
kjxgmps: proposing substate 3
kjxgmcs: Setting state to 18 3.
 Name Service recovery started
 Deleted all dead-instance name entries
kjxgmps: proposing substate 4
kjxgmcs: Setting state to 18 4.
 Multicasted all local name entries for publish
 Replayed all pending requests
kjxgmps: proposing substate 5
kjxgmcs: Setting state to 18 5.
 Name Service normal
 Name Service recovery done
*** 2011-10-18 07:09:02.652
kjxgmps: proposing substate 6
kjxgmcs: Setting state to 18 6.
*** 2011-10-18 07:09:02.652
*** 2011-10-18 07:09:02.653
Reconfiguration started (old inc 0, new inc 18)
Synchronization timeout interval: 660 sec  ====> 同步超时时间660s
List of nodes:
 0 1                                       ====> cluster node编号
 Global Resource Directory frozen
node 0
release 9 2 0 7                            ====> oracle version
node 1
release 9 2 0 7
res_master_weight for node 0 is 3122016
res_master_weight for node 1 is 3122016
 Total master weight = 6244032
 Dead  inst
 Join  inst 0 1
 Exist inst
 Active Sendback Threshold = 50 %
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
 Resources and enqueues cleaned out
 Resources remastered 0
 0 GCS shadows traversed, 0 cancelled, 0 closed
 0 GCS resources traversed, 0 cancelled
 set master node info
 Submitted all remote-enqueue requests
kjfcrfg: Number of mesgs sent to node 0 = 0
 Update rdomain variables
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
*** 2011-10-18 07:09:11.515
 0 GCS shadows traversed, 0 replayed, 0 unopened
 Submitted all GCS cache requests
 0 write requests issued in 1518892 GCS resources
 10 PIs marked suspect, 0 flush PI msgs
*** 2011-10-18 07:09:27.724
Reconfiguration complete
*** 2011-10-18 07:09:35.232
kjxgrtmc2: Member 1 thread 2 mounted
*** 2011-10-18 10:51:59.470                   ====> 这个时间点很重要
kjxgrcomerr: Communications reconfig: instance 0 (18,18)
Submitting asynchronized dump request [2]
kjxgrrcfgchk: Initiating reconfig, reason 3   ====> 通信失败
*** 2011-10-18 10:52:04.499
kjxgmrcfg: Reconfiguration started, reason 3
kjxgmcs: Setting state to 18 0.
*** 2011-10-18 10:52:04.504
 Name Service frozen
kjxgmcs: Setting state to 18 1.
*** 2011-10-18 10:52:04.514
Obtained RR update lock for sequence 18, RR seq 18
*** 2011-10-18 10:53:45.311                   ====> 出现脑裂
kjxgrrecp2: Waiting for split-brain resolution, upd 1, seq 19
*** 2011-10-18 11:03:45.315
Voting results, upd 1, seq 19, bitmap: 0
*** 2011-10-18 11:03:45.315
kjxgrdtrt: Evicted by 1, seq (19, 19)
IMR state information                         ====> 触发IMR
 Member 1, thread 2, state 4, flags 0x00a1
 RR seq 19, propstate 3, pending propstate 0
 rcfg rsn 3, rcfg time 21546700, mem ct 2
 master 1, master rcfg time 21546700

 Member information:
 Member 0, incarn 18, version 1840361946
 thrd 1, prev thrd 65535, status 0x0007, err 0x0002
 valct 0
 Member 1, incarn 18, version 816352
 thrd 2, prev thrd 65535, status 0x0107, err 0x0000
 valct 2

Group name: SIEBDB
Member id: 1
Cached SKGXN event: 0
Group State:
 State: 18 1
 Commited Map: 0 1
 New Map: 0 1
 SKGXN Map: 0 1
 Master node: 0
 Memcnt 2  Rcvcnt 0
 Substate Proposal: false
Inc Proposal:
 incarn 0  memcnt 0  master 0
 proposal false  matched false
 map:
Master Inc State:
 incarn 0  memcnt 0  agrees 0  flag 0x1
 wmap:
 nmap:
 ubmap:
Submitting asynchronized dump request [1]
error 29740 detected in background process
ORA-29740: evicted by member 1, group incarnation 19

从整个时间点来看，我们可以发现整个触发脑裂到脑裂处理完成的时间超不多15分钟。
大约时间点是这样：10:51 - 11:07

关于reconfigurtion reason 有如下几种解释：

Reason 0 = No reconfiguration
Reason 1 = The Node Monitor generated the reconfiguration.
Reason 2 = An instance death was detected.
Reason 3 = Communications Failure
Reason 4 = Reconfiguration after suspend

如果说前面那个660s就是心跳超时时间的阈值的话，那么就不会出现这个故障。
据同事说，该rac cluster节点时间相差3分钟，在当前凌晨，硬件厂商在换电池。

这样来看，基本上可以断定就是由于时间不同步导致脑裂，我们知道，在有HA等第三方cluster
软件时，io fencing是由其来完成的，所以这里我还不知道具体这个阈值是多少？

换句话说，就是超过多少秒？才出现脑裂呢？难得是前面的660s？

最后补充一下，lmon主要是检查如下两种心跳；

1. 网络心跳 说白了，就是指cluster心跳（应该是通过ping机制来完成的）
2. controlfile心跳  ckpt进程会每隔3s去更新一次controlfile，用此来判断。

《 “9i rac 脑裂导致节点重启” 》有 3 条评论

wisdomone1

2011年10月24日

顶ROGER，加油，是

回复
木鱼

2011年11月2日

再顶

回复
click here

2011年11月10日

Its like you read my mind! You seem to know so much about this, like you wrote the book in it or something. I think that you can do with a few pics to drive the message home a bit, but other than that, this is wonderful blog. A great read. I’ll certainly be back.

回复

评论

《 “9i rac 脑裂导致节点重启” 》 有 3 条评论

发表回复 取消回复

《 “9i rac 脑裂导致节点重启” 》有 3 条评论

发表回复取消回复