Extended RAC ASM 恢复案例

前几天某客户的一套Oracle extended rac 同城双活(距离35km)环境出现异常;ASM diskgroup无法mount;首先我们来看下错误。

Fri Sep 25 00:31:57 2020
NOTE: GMON heartbeating for grp 2 (SOLDATA)
GMON querying group 2 at 5 for pid 27, osid 187323
Fri Sep 25 00:31:57 2020
NOTE: cache is mounting group SOLDATA created on 2019/04/12 15:10:32
NOTE: cache opening disk 0 of grp 2: SOLDATA_0000 path:/dev/emcpowerb
NOTE: group 2 (SOLDATA) high disk header ckpt advanced to fcn 0.714
NOTE: 09/25/20 00:31:57 SOLDATA.F1X0 found on disk 0 au 10 fcn 0.714 datfmt 2
NOTE: cache opening disk 1 of grp 2: SOLDATA_0001 path:/dev/emcpowerc
NOTE: cache opening disk 2 of grp 2: SOLDATA_0002 path:/dev/emcpowerd
NOTE: cache opening disk 3 of grp 2: SOLDATA_0003 path:/dev/emcpowerh
Fri Sep 25 00:31:57 2020
NOTE: cache mounting (first) external redundancy group 2/0xB82BB917 (SOLDATA)
Fri Sep 25 00:31:57 2020
* allocate domain 2, invalid = TRUE
kjbdomatt send to inst 2
Fri Sep 25 00:31:57 2020
NOTE: attached to recovery domain 2
Fri Sep 25 00:31:57 2020
NOTE: crash recovery of group SOLDATA will recover thread=1 ckpt=28.3507 domain=2 inc#=2 instnum=2
NOTE: crash recovery of group SOLDATA will recover thread=2 ckpt=39.8576 domain=2 inc#=4 instnum=1
NOTE: crash recovery of group SOLDATA will recover thread=3 ckpt=21.9043 domain=2 inc#=6 instnum=4
NOTE: crash recovery of group SOLDATA will recover thread=4 ckpt=22.6878 domain=2 inc#=12 instnum=3
* validated domain 2, flags = 0x0
NOTE: BWR validation signaled ORA-15096
Fri Sep 25 00:31:57 2020
Errors in file /u01/product/grid/crs/diag/asm/+asm/+ASM1/trace/+ASM1_ora_187323.trc:
ORA-15096: lost disk write detected
NOTE: crash recovery signalled OER-15096
ERROR: ORA-15096 signalled during mount of diskgroup SOLDATA

从上述日志来看,在asm crash recovery阶段出现了异常,报错ora-15096. 提示也很明确;出现了写丢失。这种情况下通常会出现不一致。进一步查看上述diag trace:

*** 2020-09-25 03:33:39.996
kfdp_query: callcnt 23 grp 2 (SOLDATA)
NOTE: group 2 (SOLDATA) high disk header ckpt advanced to fcn 0.714

*** 2020-09-25 03:33:40.201
2020-09-25 03:33:40.201684 : Start recovery for domain=2, valid=0, flags=0x4
NOTE: crash recovery of group SOLDATA will recover thread=1 ckpt=28.3507 domain=2 inc#=2 instnum=2
NOTE: crash recovery of group SOLDATA will recover thread=2 ckpt=39.8576 domain=2 inc#=4 instnum=1
NOTE: crash recovery of group SOLDATA will recover thread=3 ckpt=21.9043 domain=2 inc#=6 instnum=4
NOTE: crash recovery of group SOLDATA will recover thread=4 ckpt=22.6878 domain=2 inc#=12 instnum=3
2020-09-25 03:33:40.232217 : Validate domain 2
2020-09-25 03:33:40.235370 : kjbvalidate: bcasted validate msg for domain=2
* kjbvalidate: validated domain 2, flags = 0x0
lost disk write detected during recovery:
fn=1 blk=303 last written kfcn: 0.5092245 BWR in thd=3 ABA 21.9044
mirror side: 0
OSM metadata block dump:
kfbh.endian:                          1 ; 0x000: 0x01
kfbh.hard:                          130 ; 0x001: 0x82
kfbh.type:                            4 ; 0x002: KFBTYP_FILEDIR
kfbh.datfmt:                          1 ; 0x003: 0x01
kfbh.block.blk:                     303 ; 0x004: blk=303
kfbh.block.obj:                       1 ; 0x008: file=1
kfbh.check:                  3027249708 ; 0x00c: 0xb4702a2c
kfbh.fcn.base:                  5091095 ; 0x010: 0x004daf17
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000
kfffdb.node.incarn:          1005593101 ; 0x000: A=1 NUMM=0x1df81106
kfffdb.node.frlist.number:   4294967295 ; 0x004: 0xffffffff
kfffdb.node.frlist.incarn:            0 ; 0x008: A=0 NUMM=0x0
kfffdb.hibytes:                       0 ; 0x00c: 0x00000000
kfffdb.lobytes:                   11776 ; 0x010: 0x00002e00
kfffdb.xtntcnt:                       1 ; 0x014: 0x00000001
kfffdb.xtnteof:                       1 ; 0x018: 0x00000001
kfffdb.blkSize:                     512 ; 0x01c: 0x00000200
kfffdb.flags:                        17 ; 0x020: O=1 S=0 S=0 D=0 C=1 I=0 R=0 A=0
kfffdb.fileType:                     13 ; 0x021: 0x0d

从trace内容不难发现,在对file 1 block 303 元数据进行恢复时出现了异常;具体报错数据块时thread 3  ABA 21.9044 位置。

通过kfed 修改303号元数据块fcn以及相关block checkpoint信息,可直接绕过这个错误,成功mount diskgroup。

简单案例,跟大家分享。


评论

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注