ASM 无法进行rebalance的奇怪案例

近期某客户一套环境出现异常,当进行alter diskgroup xxx modify power 0后;再次启动rebalance,发现无法启动rebalance,arb、rbal进程没有任何反应,现象大致如下:

SQL> select * from v$asm_operation;

GROUP_NUMBER OPERA PASS      STAT      POWER     ACTUAL      SOFAR   EST_WORK   EST_RATE EST_MINUTES ERROR_CODE                                       CON_ID
------------ ----- --------- ---- ---------- ---------- ---------- ---------- ---------- ----------- -------------------------------------------- ----------
           1 REBAL COMPACT   WAIT          0                                                                                                               0
           1 REBAL REBALANCE WAIT          0                                                                                                               0
           1 REBAL REBUILD   WAIT          0                                                                                                               0
           1 REBAL RESYNC    WAIT          0                                                                                                               0

 

当打开asm trace跟踪后,发现了一些蛛丝马迹:

alter system set events ‘15195 trace name context forever,level 7’;

kfdp_query: callcnt 1719757 grp 1 (DATAC1)
NOTE: GroupBlock outside rolling migration privileged region
----- Abridged Call Stack Trace -----
ksedsts()+426<-kfnmGroupBlockGlobal()+659<-kfnmGroupBlockPriv()+318<-kfgFinalize()+334<-kfxdrvAlter()+3415<-kfxdrvEntry()+1417<-opiexe()+28735<-opiosq0()+4494<-kpooprx()+387<-kpoal8()+830<-opiodr()+1202<-ttcpip()+1222<-opitsk()+1903<-opiino()+936<-opiodr()+1202
<-opidrv()+1094<-sou2o()+165<-opimai_real()+422<-ssthrdmain()+417<-main()+256<-__libc_start_main()+245
----- End of Abridged Call Stack Trace -----
Partial short call stack signature: 0xb0ac14de6c5e2e9c
SQL> alter diskgroup DATAC1 rebalance power 6
kfgpCreate: max_fg_rel 4, max_disk_part 8
kfgpPartners: NOT appliance.
kfgpPartners: max_fg_rel, max_disk_part(4, 8) has been adjusted to (3, 8) due to actual FG, disk configuration (3, 144, num_singledisk_fg 0)
kfgpPartner: necessary rebalancing detected. Avail slot for disk120 7 target 8
WARNING: Too many uncompleted reconfigurations. Rebalance needs completion.
kfgp (0x7fbf0ce71be8), allow quorum: 0, total disks: 148, FGs: total 3  active 3  normal 3  active quorum 0, max dsknum: 147, maxfgnum: 3
scores=480 ties=0 add=2 insert=0 replace=3
disk (0x7fbf0ce71440), num 0a slot 65535 fg 1 ptotal 10 pact 7 pnew 1 pdrp 2
pset dsk 0 [10]:  a15fg3 d17fg3 d6fg2 a10fg2 a16fg3 a8fg2 a13fg3 a122fg2 a49fg2 n130fg3
disk (0x7fbf0ce709b0), num 1a slot 65535 fg 1 ptotal 10 pact 8 pnew 0 pdrp 2
pset dsk 1 [10]:  d9fg2 a11fg2 d16fg3 a10fg2 a15fg3 a14fg3 a6fg2 a125fg3 a115fg2 a138fg3
disk (0x7fbf0ce70a18), num 2a slot 65535 fg 1 ptotal 8 pact 8 pnew 0 pdrp 0
pset dsk 2 [8]:  a13fg3 a11fg2 a16fg3 a17fg3 a9fg2 a7fg2 a12fg3 a55fg2
disk (0x7fbf077108e0), num 3a slot 65535 fg 1 ptotal 11 pact 8 pnew 0 pdrp 3
pset dsk 3 [11]:  d7fg2 a17fg3 a11fg2 a9fg2 a12fg3 d8fg2 a14fg3 a127fg3 a110fg2 d48fg2 a131fg3
disk (0x7fbf07710948), num 4a slot 65535 fg 1 ptotal 11 pact 8 pnew 0 pdrp 3
pset dsk 4 [11]:  a14fg3 d10fg2 d12fg3 a6fg2 d13fg3 a7fg2 a15fg3 a114fg2 a50fg2 a34fg3 a140fg3
disk (0x7fbf077109b0), num 5a slot 65535 fg 1 ptotal 13 pact 8 pnew 0 pdrp 5
pset dsk 5 [13]:  d12fg3 d7fg2 d13fg3 d8fg2 a14fg3 a9fg2 a15fg3 a58fg2 a115fg2 a35fg3 a93fg2 a135fg3 d48fg2
disk (0x7fbf0770f908), num 6a slot 65535 fg 2 ptotal 11 pact 8 pnew 0 pdrp 3
pset dsk 6 [11]:  d14fg3 d17fg3 d0fg1 a4fg1 a13fg3 a15fg3 a1fg1 a36fg1 a32fg3 a69fg3 a85fg1
......
......
disk (0x7fbf07883478), num 85a slot 65535 fg 1 ptotal 15 pact 8 pnew 0 pdrp 7
pset dsk 85 [15]:  d132fg3 d121fg2 a140fg3 d137fg3 d113fg2 a131fg3 a103fg2 d119fg2 a124fg3 d117fg2 a6fg2 a99fg2 a110fg2 d62fg3 a142fg3
disk (0x7fbf078834e0), num 86a slot 65535 fg 1 ptotal 13 pact 8 pnew 0 pdrp 5
pset dsk 86 [13]:  d141fg3 d111fg2 d122fg2 a140fg3 a137fg3 a112fg2 d109fg2 a66fg3 a8fg2 d113fg2 a26fg2 a123fg3 a27fg2
disk (0x7fbf07882328), num 87a slot 65535 fg 2 ptotal 8 pact 8 pnew 0 pdrp 0
pset dsk 87 [8]:  a89fg1 a139fg3 a82fg1 a143fg3 a137fg3 a145fg1 a73fg1 a141fg3
disk (0x7fbf07882390), num 88a slot 65535 fg 1 ptotal 11 pact 8 pnew 0 pdrp 3
pset dsk 88 [11]:  a109fg2 d142fg3 d119fg2 a115fg2 a133fg3 d106fg2 a127fg3 a110fg2 a80fg3 a120fg2 a31fg3
disk (0x7fbf078823f8), num 89a slot 65535 fg 1 ptotal 12 pact 8 pnew 0 pdrp 4
pset dsk 89 [12]:  d128fg3 d120fg2 a139fg3 a136fg3 d107fg2 d104fg2 a126fg3 a130fg3 a51fg2 a81fg2 a142fg3 a87fg2
disk (0x7fbf07882460), num 90a slot 65535 fg 1 ptotal 9 pact 8 pnew 0 pdrp 1
pset dsk 90 [9]:  a139fg3 d107fg2 a136fg3 a105fg2 a111fg2 a124fg3 a137fg3 a54fg2 a53fg2
disk (0x7fbf078824c8), num 91a slot 65535 fg 1 ptotal 13 pact 8 pnew 0 pdrp 5
pset dsk 91 [13]:  d117fg2 d121fg2 d139fg3 d115fg2 a132fg3 a102fg2 d123fg3 a104fg2 a143fg3 a27fg2 a57fg2 a62fg3 a33fg3
disk (0x7fbf078812a8), num 92a slot 65535 fg 1 ptotal 13 pact 7 pnew 1 pdrp 5
pset dsk 92 [13]:  d105fg2 d121fg2 a139fg3 d114fg2 d128fg3 a129fg3 d109fg2 a134fg3 a8fg2 a24fg2 a61fg3 a9fg2 n67fg3
disk (0x7fbf07881310), num 93a slot 65535 fg 2 ptotal 8 pact 8 pnew 0 pdrp 0
pset dsk 93 [8]:  a34fg3 a145fg1 a5fg1 a96fg1 a71fg3 a133fg3 a129fg3 a98fg1
disk (0x7fbf07881378), num 94a slot 65535 fg 1 ptotal 15 pact 8 pnew 0 pdrp 7
pset dsk 94 [15]:  d103fg2 d142fg3 d117fg2 d108fg2 a130fg3 a110fg2 d131fg3 a24fg2 a69fg3 a105fg2 a28fg2 a33fg3 d71fg3 a132fg3 d138fg3
disk (0x7fbf078813e0), num 95a slot 65535 fg 1 ptotal 10 pact 8 pnew 0 pdrp 2
pset dsk 95 [10]:  a135fg3 d119fg2 a102fg2 d126fg3 a106fg2 a127fg3 a35fg3 a118fg2 a64fg3 a114fg2
disk (0x7fbf07880228), num 96a slot 65535 fg 1 ptotal 13 pact 8 pnew 0 pdrp 5
pset dsk 96 [13]:  d133fg3 d140fg3 d102fg2 d111fg2 a123fg3 a103fg2 a124fg3 a136fg3 a93fg2 a120fg2 d122fg2 a65fg3 a69fg3
disk (0x7fbf07880290), num 97a slot 65535 fg 1 ptotal 18 pact 8 pnew 0 pdrp 10
pset dsk 97 [18]:  d110fg2 d120fg2 d132fg3 d112fg2 d133fg3 d134fg3 d114fg2 d116fg2 a137fg3 a75fg2 a127fg3 a108fg2 d76fg2 a29fg2 d64fg3 a117fg2 a15fg3 a12fg3
disk (0x7fbf0787f1a8), num 98a slot 65535 fg 1 ptotal 18 pact 8 pnew 0 pdrp 10
pset dsk 98 [18]:  d129fg3 d120fg2 d123fg3 d106fg2 a127fg3 d107fg2 a135fg3 d116fg2 d16fg3 a24fg2 a128fg3 a93fg2 a8fg2 a7fg2 d33fg3 d115fg2 d17fg3 a34fg3
disk (0x7fbf0787f210), num 99a slot 65535 fg 2 ptotal 8 pact 8 pnew 0 pdrp 0
pset dsk 99 [8]:  a46fg1 a142fg3 a40fg1 a128fg3 a84fg1 a143fg3 a85fg1 a140fg3
disk (0x7fbf0787f278), num 100a slot 65535 fg 1 ptotal 15 pact 8 pnew 0 pdrp 7
pset dsk 100 [15]:  a125fg3 a108fg2 a129fg3 d109fg2 d130fg3 d131fg3 a133fg3 d102fg2 a81fg2 a7fg2 d122fg2 a16fg3 a76fg2 d116fg2 d61fg3
disk (0x7fbf0787f2e0), num 101a slot 65535 fg 1 ptotal 9 pact 8 pnew 0 pdrp 1
pset dsk 101 [9]:  a124fg3 a104fg2 a125fg3 a126fg3 a28fg2 a68fg3 a115fg2 d107fg2 a117fg2
disk (0x7fbf0787db08), num 102a slot 65535 fg 2 ptotal 14 pact 8 pnew 0 pdrp 6
pset dsk 102 [14]:  d135fg3 d96fg1 a95fg1 d134fg3 a91fg1 a127fg3 a80fg3 d100fg1 a15fg3 d61fg3 a83fg1 a44fg1 a140fg3 d142fg3
disk (0x7fbf0787db70), num 103a slot 65535 fg 2 ptotal 12 pact 8 pnew 0 pdrp 4
pset dsk 103 [12]:  d143fg3 d94fg1 a96fg1 d135fg3 a132fg3 a85fg1 d136fg3 a144fg1 a82fg1 a67fg3 a42fg1 a35fg3
disk (0x7fbf0787dbd8), num 104a slot 65535 fg 2 ptotal 11 pact 8 pnew 0 pdrp 3
pset dsk 104 [11]:  d133fg3 a101fg1 d140fg3 d89fg1 a126fg3 a91fg1 a144fg1 a143fg3 a31fg3 a68fg3 a142fg3
disk (0x7fbf0787ca88), num 105a slot 65535 fg 2 ptotal 10 pact 8 pnew 0 pdrp 2
pset dsk 105 [10]:  d92fg1 d137fg3 a90fg1 a131fg3 a83fg1 a18fg1 a67fg3 a128fg3 a94fg1 a30fg3
disk (0x7fbf0787caf0), num 106a slot 65535 fg 2 ptotal 13 pact 7 pnew 1 pdrp 5
pset dsk 106 [13]:  d143fg3 d98fg1 a95fg1 a134fg3 d88fg1 a124fg3 a45fg1 a70fg3 d64fg3 a35fg3 d146fg1 a132fg3 n47fg1
disk (0x7fbf0787cb58), num 107i slot 65535 fg 2 ptotal 8 pact 0 pnew 0 pdrp 8
pset dsk 107 [8]:  d141fg3 d90fg1 d143fg3 d98fg1 d137fg3 d89fg1 d130fg3 d101fg1
disk (0x7fbf0787cbc0), num 108a slot 65535 fg 2 ptotal 13 pact 8 pnew 0 pdrp 5
pset dsk 108 [13]:  a100fg1 d140fg3 d94fg1 a131fg3 a123fg3 d83fg1 a144fg1 d142fg3 a33fg3 a36fg1 a97fg1 a125fg3 d71fg3
disk (0x7fbf0787ba08), num 109a slot 65535 fg 2 ptotal 14 pact 8 pnew 0 pdrp 6
pset dsk 109 [14]:  a88fg1 d100fg1 d137fg3 d92fg1 d86fg1 d129fg3 d79fg3 a20fg1 a43fg1 a130fg3 a21fg1 a66fg3 a13fg3 a147fg1
disk (0x7fbf0787ba70), num 110a slot 65535 fg 2 ptotal 12 pact 8 pnew 0 pdrp 4
pset dsk 110 [12]:  d97fg1 d127fg3 d139fg3 a94fg1 a133fg3 a137fg3 d147fg1 a88fg1 a131fg3 a3fg1 a85fg1 a130fg3
disk (0x7fbf0787bad8), num 111a slot 65535 fg 2 ptotal 11 pact 8 pnew 0 pdrp 3
pset dsk 111 [11]:  d86fg1 d142fg3 d96fg1 a136fg3 a90fg1 a82fg1 a125fg3 a71fg3 a41fg1 a16fg3 a33fg3
disk (0x7fbf0787bb40), num 112a slot 65535 fg 2 ptotal 11 pact 8 pnew 0 pdrp 3
pset dsk 112 [11]:  d142fg3 d97fg1 a123fg3 a86fg1 d130fg3 a40fg1 a22fg1 a129fg3 a134fg3 a43fg1 a70fg3
disk (0x7fbf0787a988), num 113i slot 65535 fg 2 ptotal 9 pact 0 pnew 0 pdrp 9
pset dsk 113 [9]:  d138fg3 d84fg1 d139fg3 d135fg3 d82fg1 d85fg1 d127fg3 d23fg1 d86fg1
disk (0x7fbf0787a9f0), num 114a slot 65535 fg 2 ptotal 15 pact 7 pnew 1 pdrp 7
pset dsk 114 [15]:  d123fg3 d142fg3 d97fg1 d92fg1 a124fg3 a83fg1 a126fg3 a4fg1 d71fg3 d41fg1 d67fg3 a61fg3 a95fg1 a64fg3 n69fg3
disk (0x7fbf0787aa58), num 115a slot 65535 fg 2 ptotal 13 pact 8 pnew 0 pdrp 5
pset dsk 115 [13]:  d82fg1 a136fg3 a88fg1 d132fg3 d91fg1 a133fg3 a101fg1 d141fg3 a5fg1 a1fg1 a61fg3 d98fg1 a137fg3
disk (0x7fbf0787aac0), num 116a slot 65535 fg 2 ptotal 13 pact 8 pnew 0 pdrp 5
pset dsk 116 [13]:  d136fg3 d138fg3 a84fg1 a128fg3 d98fg1 a141fg3 d97fg1 a145fg1 a135fg3 a130fg3 a21fg1 d100fg1 a129fg3
disk (0x7fbf07879908), num 117a slot 65535 fg 2 ptotal 14 pact 8 pnew 0 pdrp 6
pset dsk 117 [14]:  d91fg1 d134fg3 d135fg3 d94fg1 d125fg3 a74fg1 a73fg1 a140fg3 a101fg1 a60fg3 d85fg1 a35fg3 a17fg3 a97fg1
disk (0x7fbf07879970), num 118a slot 65535 fg 2 ptotal 13 pact 8 pnew 0 pdrp 5
pset dsk 118 [13]:  d130fg3 d131fg3 d140fg3 a44fg1 d72fg1 a38fg1 a139fg3 a95fg1 a65fg3 d143fg3 a84fg1 a61fg3 a34fg3
disk (0x7fbf078799d8), num 119i slot 65535 fg 2 ptotal 8 pact 0 pnew 0 pdrp 8
pset dsk 119 [8]:  d130fg3 d95fg1 d125fg3 d84fg1 d126fg3 d129fg3 d88fg1 d85fg1
disk (0x7fbf07879a40), num 120a slot 65535 fg 2 ptotal 20 pact 7 pnew 0 pdrp 13
pset dsk 120 [20]:  d128fg3 d89fg1 d138fg3 d97fg1 d98fg1 d124fg3 d142fg3 d145fg1 d35fg3 d45fg1 a88fg1 d33fg3 d40fg1 a96fg1 a46fg1 a20fg1 a12fg3 a147fg1 d141fg3 a78fg3
disk (0x7fbf07879aa8), num 121a slot 65535 fg 2 ptotal 16 pact 8 pnew 0 pdrp 8
pset dsk 121 [16]:  a126fg3 d85fg1 a132fg3 d91fg1 d133fg3 d92fg1 a146fg1 d16fg3 a47fg1 a141fg3 a138fg3 d12fg3 a82fg1 d142fg3 d136fg3 a69fg3
disk (0x7fbf07878888), num 122a slot 65535 fg 2 ptotal 15 pact 8 pnew 0 pdrp 7
pset dsk 122 [15]:  d124fg3 a123fg3 d82fg1 d83fg1 d127fg3 d86fg1 a128fg3 a0fg1 a78fg3 a70fg3 a84fg1 d100fg1 d96fg1 a42fg1 a39fg1
disk (0x7fbf078788f0), num 123a slot 65535 fg 3 ptotal 11 pact 8 pnew 0 pdrp 3
pset dsk 123 [11]:  d114fg2 a122fg2 d98fg1 a96fg1 a112fg2 d91fg1 a108fg2 a25fg2 a146fg1 a50fg2 a86fg1
disk (0x7fbf07877808), num 124a slot 65535 fg 3 ptotal 10 pact 8 pnew 0 pdrp 2
......
......
disk (0x7fbf07872658), num 145a slot 65535 fg 1 ptotal 13 pact 7 pnew 1 pdrp 5
pset dsk 145 [13]:  a34fg3 d49fg2 d17fg3 d138fg3 d120fg2 a136fg3 a116fg2 a137fg3 d134fg3 a87fg2 a93fg2 a56fg2 n143fg3
disk (0x7fbf07871508), num 146a slot 65535 fg 1 ptotal 11 pact 8 pnew 0 pdrp 3
pset dsk 146 [11]:  a79fg3 d77fg2 d35fg3 a52fg2 a55fg2 a75fg2 a131fg3 a121fg2 a123fg3 d106fg2 a32fg3
disk (0x7fbf07871570), num 147a slot 65535 fg 1 ptotal 14 pact 8 pnew 0 pdrp 6
pset dsk 147 [14]:  d57fg2 d69fg3 d34fg3 a128fg3 d110fg2 d78fg3 d76fg2 a138fg3 a55fg2 a31fg3 a120fg2 a16fg3 a65fg3 a109fg2
fail (0x7fbf0ce71398), name MPC2C1 num 1 size 48 act 48 new 0 drp 0 au 20889600
ptotal 566 pact 380 pnew 4 pdrp 182 rtotal 2 ract 2 rnew 0 rdrp 0
fset (0x7fbf0ce716a8), fg: 1, tot: 2
frel (0x7fbf077102c0), fg:<1 2>, totaldp:294 actdp 191 newdp 1 drpdp 102, st A
frel (0x7fbf076faf08), fg:<1 3>, totaldp:272 actdp 189 newdp 3 drpdp 80, st A
disks:     0    1    2    3    4    5   18   19   20   21   22   23   36   37   38   39   40   41   42   43   44   45   46   47   72   73   74   82   83   84   85   86   88   89   90   91   92   94   95   96   97   98  100  101  144  145  146  147
fail (0x7fbf0770f860), name MPC2C2 num 2 size 52 act 48 new 0 drp 4 au 20889600
ptotal 612 pact 381 pnew 2 pdrp 229 rtotal 2 ract 2 rnew 0 rdrp 0
fset (0x7fbf07710570), fg: 2, tot: 2
frel (0x7fbf077102c0), fg:<1 2>, totaldp:294 actdp 191 newdp 1 drpdp 102, st A
frel (0x7fbf076fae48), fg:<2 3>, totaldp:318 actdp 190 newdp 1 drpdp 127, st A
disks:     6    7    8    9   10   11   24   25   26   27   28   29   48   49   50   51   52   53   54   55   56   57   58   59   75   76   77   81   87   93   99  102  103  104  105  106  107  108  109  110  111  112  113  114  115  116  117  118  119
120  121  122
fail (0x7fbf076fa788), name MPC2C3 num 3 size 48 act 48 new 0 drp 0 au 20889600
ptotal 590 pact 379 pnew 4 pdrp 207 rtotal 2 ract 2 rnew 0 rdrp 0
fset (0x7fbf076fb158), fg: 3, tot: 2
frel (0x7fbf076faf08), fg:<1 3>, totaldp:272 actdp 189 newdp 3 drpdp 80, st A
frel (0x7fbf076fae48), fg:<2 3>, totaldp:318 actdp 190 newdp 1 drpdp 127, st A
disks:    12   13   14   15   16   17   30   31   32   33   34   35   60   61   62   63   64   65   66   67   68   69   70   71   78   79   80  123  124  125  126  127  128  129  130  131  132  133  134  135  136  137  138  139  140  141  142  143
cset (0x7fbf0ce71358), total frels: 3
frel (0x7fbf077102c0), fg:<1 2>, totaldp:294 actdp 191 newdp 1 drpdp 102, st A
frel (0x7fbf076faf08), fg:<1 3>, totaldp:272 actdp 189 newdp 3 drpdp 80, st A
frel (0x7fbf076fae48), fg:<2 3>, totaldp:318 actdp 190 newdp 1 drpdp 127, st A
kfdp_query: callcnt 1721983 grp 1 (DATAC1)
NOTE: GroupBlock outside rolling migration privileged region
----- Abridged Call Stack Trace -----
ksedsts()+426<-kfnmGroupBlockGlobal()+659<-kfnmGroupBlockPriv()+318<-kfgFinalize()+334<-kfxdrvAlter()+3415<-kfxdrvEntry()+1417<-opiexe()+28735<-opiosq0()+4494<-kpooprx()+387<-kpoal8()+830<-opiodr()+1202<-ttcpip()+1222<-opitsk()+1903<-opiino()+936<-opiodr()+1202
<-opidrv()+1094<-sou2o()+165<-opimai_real()+422<-ssthrdmain()+417<-main()+256<-__libc_start_main()+245
----- End of Abridged Call Stack Trace -----
Partial short call stack signature: 0xb0ac14de6c5e2e9c
SQL>  alter diskgroup DATAC1 rebalance power 6
kfgpCreate: max_fg_rel 4, max_disk_part 8
kfgpPartners: NOT appliance.
kfgpPartners: max_fg_rel, max_disk_part(4, 8) has been adjusted to (3, 8) due to actual FG, disk configuration (3, 144, num_singledisk_fg 0)
kfgpPartners: verifying consistency of newly formed  partners.
kfgpPartners: repartnering completed.
kfgpGet: insufficient space provided by caller. size 21, pcnt 20, KFPTNR_MAXTOT 20
WARNING: Too many uncompleted reconfigurations. Rebalance needs completion.
kfgp (0x7fb69d5f2910), allow quorum: 0, total disks: 148, FGs: total 3  active 3  normal 3  active quorum 0, max dsknum: 147, maxfgnum: 3
scores=55296 ties=9696 add=576 insert=0 replace=0
disk (0x7fb69d5f1c90), num 0a slot 65535 fg 1 ptotal 8 pact 0 pnew 8 pdrp 0

 

从第一次的trace来看,oracle asm提示相关disk pst partner信息有问题;因此我们使用了level 0x39 进行了pst partner关系的重建。但是发现仍然无法解决问题,后面再报kfgpGet: insufficient space provided by caller. size 21, pcnt 20, KFPTNR_MAXTOT 20。

针对该问题,我在我们内部测试环境进行了相关模拟,通过频繁offline、drop disk然后add disk,在磁盘操作过程后,多次进行rebalance power的修改;大约测试了不下10次,最终遇到了一个未知的错误:

SQL> alter diskgroup dg_data01 drop disk DG_DATA01_0135 force
2021-07-16T17:04:25.492886+08:00
NOTE: cache closing disk 139 of grp 1: (not open) _DROPPED_0139_DG_DATA01
NOTE: GroupBlock outside rolling migration privileged region
NOTE: full repartnering enabled for group 1 by test event 15195 level 0x39
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_160232.trc  (incident=40010):
ORA-00600: internal error code, arguments: [kfgCanRepartner01], [2], [3], [6], [], [], [], [], [], [], [], []
Incident details in: /u01/app/grid/diag/asm/+asm/+ASM1/incident/incdir_40010/+ASM1_ora_160232_i40010.trc
2021-07-16T17:04:26.506682+08:00
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
2021-07-16T17:04:26.506873+08:00
ORA-00600: internal error code, arguments: [kfgCanRepartner01], [2], [3], [6], [], [], [], [], [], [], [], []
2021-07-16T17:04:26.506954+08:00
ERROR: alter diskgroup dg_data01 drop disk DG_DATA01_0135 force
2021-07-16T17:04:26.509210+08:00
SQL> alter diskgroup dg_data01 drop disk DG_DATA01_0134 force
2021-07-16T17:04:26.509917+08:00
NOTE: cache closing disk 139 of grp 1: (not open) _DROPPED_0139_DG_DATA01
NOTE: GroupBlock outside rolling migration privileged region
NOTE: full repartnering enabled for group 1 by test event 15195 level 0x39
2021-07-16T17:04:26.584378+08:00
Dumping diagnostic data in directory=[cdmp_20210716170426], requested by (instance=1, osid=160232), summary=[incident=40010].
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_160232.trc  (incident=40011):
ORA-00600: internal error code, arguments: [kfgCanRepartner01], [3], [1], [9], [], [], [], [], [], [], [], []
Incident details in: /u01/app/grid/diag/asm/+asm/+ASM1/incident/incdir_40011/+ASM1_ora_160232_i40011.trc

 

上述错误之前从未遇见过,可见Oracle 19c 版本中,对于ASM 的管理仍然存在一些不足之处;频繁的进行disk drop、add操作;在rebalance没有完成之前,是可能引发一些问题的,不过从测试来看,19c版本相比11.2.0.4版本,ASM 相关检测机制更加完善了,也更加健壮了一些。

再回到本次的案例。在一筹莫展之际,某天晚上,该用户环境其中一个存储节点磁盘被offline,通过online激活后,竟然发现磁盘组rebalance操作可以正常进行了。为此我进行了进一步跟踪分析,如下是此次磁盘offline涉及到的相关disk:

*** 2021-07-23T05:41:27.626857+08:00
NOTE: initiating PST update: grp 1 (DATAC1), dsk = 82/0x0, mask = 0x7f, op = assign mandatory
NOTE: initiating PST update: grp 1 (DATAC1), dsk = 88/0x0, mask = 0x7f, op = assign mandatory
NOTE: initiating PST update: grp 1 (DATAC1), dsk = 94/0x0, mask = 0x7f, op = assign mandatory
NOTE: initiating PST update: grp 1 (DATAC1), dsk = 100/0x0, mask = 0x7f, op = assign mandatory
kfdp_updateDsk(): callcnt 1766027 grp 1
PST verChk -0: req, id=3197182789, grp=1, requested=146 at 07/23/2021 05:41:27
NOTE: PST update grp = 1 completed successfully
NOTE: kfdsFilter_freeDskSrSlice for Filter 0x7ff72009cfd0
NOTE: kfdsFilter_clearDskSlice for Filter 0x7ff72009cfd0 (all:TRUE)
NOTE: completed online of disk group 1 disks
DATAC1_0082 (82)
DATAC1_0088 (88)
DATAC1_0094 (94)
DATAC1_0100 (100)
ARB0 relocating file +DATAC1.1.1 reason 6 (1 entries first xnum 0x1)
ARB0 relocating file +DATAC1.3.1 reason 6 (9 entries first xnum 0x3)

 

我们发现一共涉及到4个disk,分别是82/88/94/100 4个disk。从前面的trace 我们知道,之前无法进行rebalance的原因主要是卡在了disk 120上,且Oracle提示该disk pst的slot 已达到最大值,实际上通过kfed分析发现该结构最大就是20.

那么为什么巧合之际有4个盘被offline、online之后,整个diskgroup rebalance操作就恢复正常了呢?

最终我们分析发现此次offline操作的4个盘之一是88,其中该磁盘正好是120 disk的partner。我们认为offline 操作后,最终使oracle跳过了针对disk 120的一致性检查。

从这里看,我们之前给用户提供的解决方案也是符合的:

1、offline disk 120;然后online(offline、online过程不会除非rebalance,在disk repair time之内)

2、drop 120 disk force;然后手工执行rebalance。

 

这个案例相对比较有意思,特此简单记录一下。比较特殊的是该diskgroup 比较大,大概250TB,因为操作比较慎重。


评论

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注