近期某客户一套环境出现异常,当进行alter diskgroup xxx modify power 0后;再次启动rebalance,发现无法启动rebalance,arb、rbal进程没有任何反应,现象大致如下:
SQL> select * from v$asm_operation; GROUP_NUMBER OPERA PASS STAT POWER ACTUAL SOFAR EST_WORK EST_RATE EST_MINUTES ERROR_CODE CON_ID ------------ ----- --------- ---- ---------- ---------- ---------- ---------- ---------- ----------- -------------------------------------------- ---------- 1 REBAL COMPACT WAIT 0 0 1 REBAL REBALANCE WAIT 0 0 1 REBAL REBUILD WAIT 0 0 1 REBAL RESYNC WAIT 0 0
当打开asm trace跟踪后,发现了一些蛛丝马迹:
alter system set events ‘15195 trace name context forever,level 7’;
kfdp_query: callcnt 1719757 grp 1 (DATAC1) NOTE: GroupBlock outside rolling migration privileged region ----- Abridged Call Stack Trace ----- ksedsts()+426<-kfnmGroupBlockGlobal()+659<-kfnmGroupBlockPriv()+318<-kfgFinalize()+334<-kfxdrvAlter()+3415<-kfxdrvEntry()+1417<-opiexe()+28735<-opiosq0()+4494<-kpooprx()+387<-kpoal8()+830<-opiodr()+1202<-ttcpip()+1222<-opitsk()+1903<-opiino()+936<-opiodr()+1202 <-opidrv()+1094<-sou2o()+165<-opimai_real()+422<-ssthrdmain()+417<-main()+256<-__libc_start_main()+245 ----- End of Abridged Call Stack Trace ----- Partial short call stack signature: 0xb0ac14de6c5e2e9c SQL> alter diskgroup DATAC1 rebalance power 6 kfgpCreate: max_fg_rel 4, max_disk_part 8 kfgpPartners: NOT appliance. kfgpPartners: max_fg_rel, max_disk_part(4, 8) has been adjusted to (3, 8) due to actual FG, disk configuration (3, 144, num_singledisk_fg 0) kfgpPartner: necessary rebalancing detected. Avail slot for disk120 7 target 8 WARNING: Too many uncompleted reconfigurations. Rebalance needs completion. kfgp (0x7fbf0ce71be8), allow quorum: 0, total disks: 148, FGs: total 3 active 3 normal 3 active quorum 0, max dsknum: 147, maxfgnum: 3 scores=480 ties=0 add=2 insert=0 replace=3 disk (0x7fbf0ce71440), num 0a slot 65535 fg 1 ptotal 10 pact 7 pnew 1 pdrp 2 pset dsk 0 [10]: a15fg3 d17fg3 d6fg2 a10fg2 a16fg3 a8fg2 a13fg3 a122fg2 a49fg2 n130fg3 disk (0x7fbf0ce709b0), num 1a slot 65535 fg 1 ptotal 10 pact 8 pnew 0 pdrp 2 pset dsk 1 [10]: d9fg2 a11fg2 d16fg3 a10fg2 a15fg3 a14fg3 a6fg2 a125fg3 a115fg2 a138fg3 disk (0x7fbf0ce70a18), num 2a slot 65535 fg 1 ptotal 8 pact 8 pnew 0 pdrp 0 pset dsk 2 [8]: a13fg3 a11fg2 a16fg3 a17fg3 a9fg2 a7fg2 a12fg3 a55fg2 disk (0x7fbf077108e0), num 3a slot 65535 fg 1 ptotal 11 pact 8 pnew 0 pdrp 3 pset dsk 3 [11]: d7fg2 a17fg3 a11fg2 a9fg2 a12fg3 d8fg2 a14fg3 a127fg3 a110fg2 d48fg2 a131fg3 disk (0x7fbf07710948), num 4a slot 65535 fg 1 ptotal 11 pact 8 pnew 0 pdrp 3 pset dsk 4 [11]: a14fg3 d10fg2 d12fg3 a6fg2 d13fg3 a7fg2 a15fg3 a114fg2 a50fg2 a34fg3 a140fg3 disk (0x7fbf077109b0), num 5a slot 65535 fg 1 ptotal 13 pact 8 pnew 0 pdrp 5 pset dsk 5 [13]: d12fg3 d7fg2 d13fg3 d8fg2 a14fg3 a9fg2 a15fg3 a58fg2 a115fg2 a35fg3 a93fg2 a135fg3 d48fg2 disk (0x7fbf0770f908), num 6a slot 65535 fg 2 ptotal 11 pact 8 pnew 0 pdrp 3 pset dsk 6 [11]: d14fg3 d17fg3 d0fg1 a4fg1 a13fg3 a15fg3 a1fg1 a36fg1 a32fg3 a69fg3 a85fg1 ...... ...... disk (0x7fbf07883478), num 85a slot 65535 fg 1 ptotal 15 pact 8 pnew 0 pdrp 7 pset dsk 85 [15]: d132fg3 d121fg2 a140fg3 d137fg3 d113fg2 a131fg3 a103fg2 d119fg2 a124fg3 d117fg2 a6fg2 a99fg2 a110fg2 d62fg3 a142fg3 disk (0x7fbf078834e0), num 86a slot 65535 fg 1 ptotal 13 pact 8 pnew 0 pdrp 5 pset dsk 86 [13]: d141fg3 d111fg2 d122fg2 a140fg3 a137fg3 a112fg2 d109fg2 a66fg3 a8fg2 d113fg2 a26fg2 a123fg3 a27fg2 disk (0x7fbf07882328), num 87a slot 65535 fg 2 ptotal 8 pact 8 pnew 0 pdrp 0 pset dsk 87 [8]: a89fg1 a139fg3 a82fg1 a143fg3 a137fg3 a145fg1 a73fg1 a141fg3 disk (0x7fbf07882390), num 88a slot 65535 fg 1 ptotal 11 pact 8 pnew 0 pdrp 3 pset dsk 88 [11]: a109fg2 d142fg3 d119fg2 a115fg2 a133fg3 d106fg2 a127fg3 a110fg2 a80fg3 a120fg2 a31fg3 disk (0x7fbf078823f8), num 89a slot 65535 fg 1 ptotal 12 pact 8 pnew 0 pdrp 4 pset dsk 89 [12]: d128fg3 d120fg2 a139fg3 a136fg3 d107fg2 d104fg2 a126fg3 a130fg3 a51fg2 a81fg2 a142fg3 a87fg2 disk (0x7fbf07882460), num 90a slot 65535 fg 1 ptotal 9 pact 8 pnew 0 pdrp 1 pset dsk 90 [9]: a139fg3 d107fg2 a136fg3 a105fg2 a111fg2 a124fg3 a137fg3 a54fg2 a53fg2 disk (0x7fbf078824c8), num 91a slot 65535 fg 1 ptotal 13 pact 8 pnew 0 pdrp 5 pset dsk 91 [13]: d117fg2 d121fg2 d139fg3 d115fg2 a132fg3 a102fg2 d123fg3 a104fg2 a143fg3 a27fg2 a57fg2 a62fg3 a33fg3 disk (0x7fbf078812a8), num 92a slot 65535 fg 1 ptotal 13 pact 7 pnew 1 pdrp 5 pset dsk 92 [13]: d105fg2 d121fg2 a139fg3 d114fg2 d128fg3 a129fg3 d109fg2 a134fg3 a8fg2 a24fg2 a61fg3 a9fg2 n67fg3 disk (0x7fbf07881310), num 93a slot 65535 fg 2 ptotal 8 pact 8 pnew 0 pdrp 0 pset dsk 93 [8]: a34fg3 a145fg1 a5fg1 a96fg1 a71fg3 a133fg3 a129fg3 a98fg1 disk (0x7fbf07881378), num 94a slot 65535 fg 1 ptotal 15 pact 8 pnew 0 pdrp 7 pset dsk 94 [15]: d103fg2 d142fg3 d117fg2 d108fg2 a130fg3 a110fg2 d131fg3 a24fg2 a69fg3 a105fg2 a28fg2 a33fg3 d71fg3 a132fg3 d138fg3 disk (0x7fbf078813e0), num 95a slot 65535 fg 1 ptotal 10 pact 8 pnew 0 pdrp 2 pset dsk 95 [10]: a135fg3 d119fg2 a102fg2 d126fg3 a106fg2 a127fg3 a35fg3 a118fg2 a64fg3 a114fg2 disk (0x7fbf07880228), num 96a slot 65535 fg 1 ptotal 13 pact 8 pnew 0 pdrp 5 pset dsk 96 [13]: d133fg3 d140fg3 d102fg2 d111fg2 a123fg3 a103fg2 a124fg3 a136fg3 a93fg2 a120fg2 d122fg2 a65fg3 a69fg3 disk (0x7fbf07880290), num 97a slot 65535 fg 1 ptotal 18 pact 8 pnew 0 pdrp 10 pset dsk 97 [18]: d110fg2 d120fg2 d132fg3 d112fg2 d133fg3 d134fg3 d114fg2 d116fg2 a137fg3 a75fg2 a127fg3 a108fg2 d76fg2 a29fg2 d64fg3 a117fg2 a15fg3 a12fg3 disk (0x7fbf0787f1a8), num 98a slot 65535 fg 1 ptotal 18 pact 8 pnew 0 pdrp 10 pset dsk 98 [18]: d129fg3 d120fg2 d123fg3 d106fg2 a127fg3 d107fg2 a135fg3 d116fg2 d16fg3 a24fg2 a128fg3 a93fg2 a8fg2 a7fg2 d33fg3 d115fg2 d17fg3 a34fg3 disk (0x7fbf0787f210), num 99a slot 65535 fg 2 ptotal 8 pact 8 pnew 0 pdrp 0 pset dsk 99 [8]: a46fg1 a142fg3 a40fg1 a128fg3 a84fg1 a143fg3 a85fg1 a140fg3 disk (0x7fbf0787f278), num 100a slot 65535 fg 1 ptotal 15 pact 8 pnew 0 pdrp 7 pset dsk 100 [15]: a125fg3 a108fg2 a129fg3 d109fg2 d130fg3 d131fg3 a133fg3 d102fg2 a81fg2 a7fg2 d122fg2 a16fg3 a76fg2 d116fg2 d61fg3 disk (0x7fbf0787f2e0), num 101a slot 65535 fg 1 ptotal 9 pact 8 pnew 0 pdrp 1 pset dsk 101 [9]: a124fg3 a104fg2 a125fg3 a126fg3 a28fg2 a68fg3 a115fg2 d107fg2 a117fg2 disk (0x7fbf0787db08), num 102a slot 65535 fg 2 ptotal 14 pact 8 pnew 0 pdrp 6 pset dsk 102 [14]: d135fg3 d96fg1 a95fg1 d134fg3 a91fg1 a127fg3 a80fg3 d100fg1 a15fg3 d61fg3 a83fg1 a44fg1 a140fg3 d142fg3 disk (0x7fbf0787db70), num 103a slot 65535 fg 2 ptotal 12 pact 8 pnew 0 pdrp 4 pset dsk 103 [12]: d143fg3 d94fg1 a96fg1 d135fg3 a132fg3 a85fg1 d136fg3 a144fg1 a82fg1 a67fg3 a42fg1 a35fg3 disk (0x7fbf0787dbd8), num 104a slot 65535 fg 2 ptotal 11 pact 8 pnew 0 pdrp 3 pset dsk 104 [11]: d133fg3 a101fg1 d140fg3 d89fg1 a126fg3 a91fg1 a144fg1 a143fg3 a31fg3 a68fg3 a142fg3 disk (0x7fbf0787ca88), num 105a slot 65535 fg 2 ptotal 10 pact 8 pnew 0 pdrp 2 pset dsk 105 [10]: d92fg1 d137fg3 a90fg1 a131fg3 a83fg1 a18fg1 a67fg3 a128fg3 a94fg1 a30fg3 disk (0x7fbf0787caf0), num 106a slot 65535 fg 2 ptotal 13 pact 7 pnew 1 pdrp 5 pset dsk 106 [13]: d143fg3 d98fg1 a95fg1 a134fg3 d88fg1 a124fg3 a45fg1 a70fg3 d64fg3 a35fg3 d146fg1 a132fg3 n47fg1 disk (0x7fbf0787cb58), num 107i slot 65535 fg 2 ptotal 8 pact 0 pnew 0 pdrp 8 pset dsk 107 [8]: d141fg3 d90fg1 d143fg3 d98fg1 d137fg3 d89fg1 d130fg3 d101fg1 disk (0x7fbf0787cbc0), num 108a slot 65535 fg 2 ptotal 13 pact 8 pnew 0 pdrp 5 pset dsk 108 [13]: a100fg1 d140fg3 d94fg1 a131fg3 a123fg3 d83fg1 a144fg1 d142fg3 a33fg3 a36fg1 a97fg1 a125fg3 d71fg3 disk (0x7fbf0787ba08), num 109a slot 65535 fg 2 ptotal 14 pact 8 pnew 0 pdrp 6 pset dsk 109 [14]: a88fg1 d100fg1 d137fg3 d92fg1 d86fg1 d129fg3 d79fg3 a20fg1 a43fg1 a130fg3 a21fg1 a66fg3 a13fg3 a147fg1 disk (0x7fbf0787ba70), num 110a slot 65535 fg 2 ptotal 12 pact 8 pnew 0 pdrp 4 pset dsk 110 [12]: d97fg1 d127fg3 d139fg3 a94fg1 a133fg3 a137fg3 d147fg1 a88fg1 a131fg3 a3fg1 a85fg1 a130fg3 disk (0x7fbf0787bad8), num 111a slot 65535 fg 2 ptotal 11 pact 8 pnew 0 pdrp 3 pset dsk 111 [11]: d86fg1 d142fg3 d96fg1 a136fg3 a90fg1 a82fg1 a125fg3 a71fg3 a41fg1 a16fg3 a33fg3 disk (0x7fbf0787bb40), num 112a slot 65535 fg 2 ptotal 11 pact 8 pnew 0 pdrp 3 pset dsk 112 [11]: d142fg3 d97fg1 a123fg3 a86fg1 d130fg3 a40fg1 a22fg1 a129fg3 a134fg3 a43fg1 a70fg3 disk (0x7fbf0787a988), num 113i slot 65535 fg 2 ptotal 9 pact 0 pnew 0 pdrp 9 pset dsk 113 [9]: d138fg3 d84fg1 d139fg3 d135fg3 d82fg1 d85fg1 d127fg3 d23fg1 d86fg1 disk (0x7fbf0787a9f0), num 114a slot 65535 fg 2 ptotal 15 pact 7 pnew 1 pdrp 7 pset dsk 114 [15]: d123fg3 d142fg3 d97fg1 d92fg1 a124fg3 a83fg1 a126fg3 a4fg1 d71fg3 d41fg1 d67fg3 a61fg3 a95fg1 a64fg3 n69fg3 disk (0x7fbf0787aa58), num 115a slot 65535 fg 2 ptotal 13 pact 8 pnew 0 pdrp 5 pset dsk 115 [13]: d82fg1 a136fg3 a88fg1 d132fg3 d91fg1 a133fg3 a101fg1 d141fg3 a5fg1 a1fg1 a61fg3 d98fg1 a137fg3 disk (0x7fbf0787aac0), num 116a slot 65535 fg 2 ptotal 13 pact 8 pnew 0 pdrp 5 pset dsk 116 [13]: d136fg3 d138fg3 a84fg1 a128fg3 d98fg1 a141fg3 d97fg1 a145fg1 a135fg3 a130fg3 a21fg1 d100fg1 a129fg3 disk (0x7fbf07879908), num 117a slot 65535 fg 2 ptotal 14 pact 8 pnew 0 pdrp 6 pset dsk 117 [14]: d91fg1 d134fg3 d135fg3 d94fg1 d125fg3 a74fg1 a73fg1 a140fg3 a101fg1 a60fg3 d85fg1 a35fg3 a17fg3 a97fg1 disk (0x7fbf07879970), num 118a slot 65535 fg 2 ptotal 13 pact 8 pnew 0 pdrp 5 pset dsk 118 [13]: d130fg3 d131fg3 d140fg3 a44fg1 d72fg1 a38fg1 a139fg3 a95fg1 a65fg3 d143fg3 a84fg1 a61fg3 a34fg3 disk (0x7fbf078799d8), num 119i slot 65535 fg 2 ptotal 8 pact 0 pnew 0 pdrp 8 pset dsk 119 [8]: d130fg3 d95fg1 d125fg3 d84fg1 d126fg3 d129fg3 d88fg1 d85fg1 disk (0x7fbf07879a40), num 120a slot 65535 fg 2 ptotal 20 pact 7 pnew 0 pdrp 13 pset dsk 120 [20]: d128fg3 d89fg1 d138fg3 d97fg1 d98fg1 d124fg3 d142fg3 d145fg1 d35fg3 d45fg1 a88fg1 d33fg3 d40fg1 a96fg1 a46fg1 a20fg1 a12fg3 a147fg1 d141fg3 a78fg3 disk (0x7fbf07879aa8), num 121a slot 65535 fg 2 ptotal 16 pact 8 pnew 0 pdrp 8 pset dsk 121 [16]: a126fg3 d85fg1 a132fg3 d91fg1 d133fg3 d92fg1 a146fg1 d16fg3 a47fg1 a141fg3 a138fg3 d12fg3 a82fg1 d142fg3 d136fg3 a69fg3 disk (0x7fbf07878888), num 122a slot 65535 fg 2 ptotal 15 pact 8 pnew 0 pdrp 7 pset dsk 122 [15]: d124fg3 a123fg3 d82fg1 d83fg1 d127fg3 d86fg1 a128fg3 a0fg1 a78fg3 a70fg3 a84fg1 d100fg1 d96fg1 a42fg1 a39fg1 disk (0x7fbf078788f0), num 123a slot 65535 fg 3 ptotal 11 pact 8 pnew 0 pdrp 3 pset dsk 123 [11]: d114fg2 a122fg2 d98fg1 a96fg1 a112fg2 d91fg1 a108fg2 a25fg2 a146fg1 a50fg2 a86fg1 disk (0x7fbf07877808), num 124a slot 65535 fg 3 ptotal 10 pact 8 pnew 0 pdrp 2 ...... ...... disk (0x7fbf07872658), num 145a slot 65535 fg 1 ptotal 13 pact 7 pnew 1 pdrp 5 pset dsk 145 [13]: a34fg3 d49fg2 d17fg3 d138fg3 d120fg2 a136fg3 a116fg2 a137fg3 d134fg3 a87fg2 a93fg2 a56fg2 n143fg3 disk (0x7fbf07871508), num 146a slot 65535 fg 1 ptotal 11 pact 8 pnew 0 pdrp 3 pset dsk 146 [11]: a79fg3 d77fg2 d35fg3 a52fg2 a55fg2 a75fg2 a131fg3 a121fg2 a123fg3 d106fg2 a32fg3 disk (0x7fbf07871570), num 147a slot 65535 fg 1 ptotal 14 pact 8 pnew 0 pdrp 6 pset dsk 147 [14]: d57fg2 d69fg3 d34fg3 a128fg3 d110fg2 d78fg3 d76fg2 a138fg3 a55fg2 a31fg3 a120fg2 a16fg3 a65fg3 a109fg2 fail (0x7fbf0ce71398), name MPC2C1 num 1 size 48 act 48 new 0 drp 0 au 20889600 ptotal 566 pact 380 pnew 4 pdrp 182 rtotal 2 ract 2 rnew 0 rdrp 0 fset (0x7fbf0ce716a8), fg: 1, tot: 2 frel (0x7fbf077102c0), fg:<1 2>, totaldp:294 actdp 191 newdp 1 drpdp 102, st A frel (0x7fbf076faf08), fg:<1 3>, totaldp:272 actdp 189 newdp 3 drpdp 80, st A disks: 0 1 2 3 4 5 18 19 20 21 22 23 36 37 38 39 40 41 42 43 44 45 46 47 72 73 74 82 83 84 85 86 88 89 90 91 92 94 95 96 97 98 100 101 144 145 146 147 fail (0x7fbf0770f860), name MPC2C2 num 2 size 52 act 48 new 0 drp 4 au 20889600 ptotal 612 pact 381 pnew 2 pdrp 229 rtotal 2 ract 2 rnew 0 rdrp 0 fset (0x7fbf07710570), fg: 2, tot: 2 frel (0x7fbf077102c0), fg:<1 2>, totaldp:294 actdp 191 newdp 1 drpdp 102, st A frel (0x7fbf076fae48), fg:<2 3>, totaldp:318 actdp 190 newdp 1 drpdp 127, st A disks: 6 7 8 9 10 11 24 25 26 27 28 29 48 49 50 51 52 53 54 55 56 57 58 59 75 76 77 81 87 93 99 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 fail (0x7fbf076fa788), name MPC2C3 num 3 size 48 act 48 new 0 drp 0 au 20889600 ptotal 590 pact 379 pnew 4 pdrp 207 rtotal 2 ract 2 rnew 0 rdrp 0 fset (0x7fbf076fb158), fg: 3, tot: 2 frel (0x7fbf076faf08), fg:<1 3>, totaldp:272 actdp 189 newdp 3 drpdp 80, st A frel (0x7fbf076fae48), fg:<2 3>, totaldp:318 actdp 190 newdp 1 drpdp 127, st A disks: 12 13 14 15 16 17 30 31 32 33 34 35 60 61 62 63 64 65 66 67 68 69 70 71 78 79 80 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 cset (0x7fbf0ce71358), total frels: 3 frel (0x7fbf077102c0), fg:<1 2>, totaldp:294 actdp 191 newdp 1 drpdp 102, st A frel (0x7fbf076faf08), fg:<1 3>, totaldp:272 actdp 189 newdp 3 drpdp 80, st A frel (0x7fbf076fae48), fg:<2 3>, totaldp:318 actdp 190 newdp 1 drpdp 127, st A kfdp_query: callcnt 1721983 grp 1 (DATAC1) NOTE: GroupBlock outside rolling migration privileged region ----- Abridged Call Stack Trace ----- ksedsts()+426<-kfnmGroupBlockGlobal()+659<-kfnmGroupBlockPriv()+318<-kfgFinalize()+334<-kfxdrvAlter()+3415<-kfxdrvEntry()+1417<-opiexe()+28735<-opiosq0()+4494<-kpooprx()+387<-kpoal8()+830<-opiodr()+1202<-ttcpip()+1222<-opitsk()+1903<-opiino()+936<-opiodr()+1202 <-opidrv()+1094<-sou2o()+165<-opimai_real()+422<-ssthrdmain()+417<-main()+256<-__libc_start_main()+245 ----- End of Abridged Call Stack Trace ----- Partial short call stack signature: 0xb0ac14de6c5e2e9c SQL> alter diskgroup DATAC1 rebalance power 6 kfgpCreate: max_fg_rel 4, max_disk_part 8 kfgpPartners: NOT appliance. kfgpPartners: max_fg_rel, max_disk_part(4, 8) has been adjusted to (3, 8) due to actual FG, disk configuration (3, 144, num_singledisk_fg 0) kfgpPartners: verifying consistency of newly formed partners. kfgpPartners: repartnering completed. kfgpGet: insufficient space provided by caller. size 21, pcnt 20, KFPTNR_MAXTOT 20 WARNING: Too many uncompleted reconfigurations. Rebalance needs completion. kfgp (0x7fb69d5f2910), allow quorum: 0, total disks: 148, FGs: total 3 active 3 normal 3 active quorum 0, max dsknum: 147, maxfgnum: 3 scores=55296 ties=9696 add=576 insert=0 replace=0 disk (0x7fb69d5f1c90), num 0a slot 65535 fg 1 ptotal 8 pact 0 pnew 8 pdrp 0
从第一次的trace来看,oracle asm提示相关disk pst partner信息有问题;因此我们使用了level 0x39 进行了pst partner关系的重建。但是发现仍然无法解决问题,后面再报kfgpGet: insufficient space provided by caller. size 21, pcnt 20, KFPTNR_MAXTOT 20。
针对该问题,我在我们内部测试环境进行了相关模拟,通过频繁offline、drop disk然后add disk,在磁盘操作过程后,多次进行rebalance power的修改;大约测试了不下10次,最终遇到了一个未知的错误:
SQL> alter diskgroup dg_data01 drop disk DG_DATA01_0135 force 2021-07-16T17:04:25.492886+08:00 NOTE: cache closing disk 139 of grp 1: (not open) _DROPPED_0139_DG_DATA01 NOTE: GroupBlock outside rolling migration privileged region NOTE: full repartnering enabled for group 1 by test event 15195 level 0x39 Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_160232.trc (incident=40010): ORA-00600: internal error code, arguments: [kfgCanRepartner01], [2], [3], [6], [], [], [], [], [], [], [], [] Incident details in: /u01/app/grid/diag/asm/+asm/+ASM1/incident/incdir_40010/+ASM1_ora_160232_i40010.trc 2021-07-16T17:04:26.506682+08:00 Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. 2021-07-16T17:04:26.506873+08:00 ORA-00600: internal error code, arguments: [kfgCanRepartner01], [2], [3], [6], [], [], [], [], [], [], [], [] 2021-07-16T17:04:26.506954+08:00 ERROR: alter diskgroup dg_data01 drop disk DG_DATA01_0135 force 2021-07-16T17:04:26.509210+08:00 SQL> alter diskgroup dg_data01 drop disk DG_DATA01_0134 force 2021-07-16T17:04:26.509917+08:00 NOTE: cache closing disk 139 of grp 1: (not open) _DROPPED_0139_DG_DATA01 NOTE: GroupBlock outside rolling migration privileged region NOTE: full repartnering enabled for group 1 by test event 15195 level 0x39 2021-07-16T17:04:26.584378+08:00 Dumping diagnostic data in directory=[cdmp_20210716170426], requested by (instance=1, osid=160232), summary=[incident=40010]. Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_160232.trc (incident=40011): ORA-00600: internal error code, arguments: [kfgCanRepartner01], [3], [1], [9], [], [], [], [], [], [], [], [] Incident details in: /u01/app/grid/diag/asm/+asm/+ASM1/incident/incdir_40011/+ASM1_ora_160232_i40011.trc
上述错误之前从未遇见过,可见Oracle 19c 版本中,对于ASM 的管理仍然存在一些不足之处;频繁的进行disk drop、add操作;在rebalance没有完成之前,是可能引发一些问题的,不过从测试来看,19c版本相比11.2.0.4版本,ASM 相关检测机制更加完善了,也更加健壮了一些。
再回到本次的案例。在一筹莫展之际,某天晚上,该用户环境其中一个存储节点磁盘被offline,通过online激活后,竟然发现磁盘组rebalance操作可以正常进行了。为此我进行了进一步跟踪分析,如下是此次磁盘offline涉及到的相关disk:
*** 2021-07-23T05:41:27.626857+08:00 NOTE: initiating PST update: grp 1 (DATAC1), dsk = 82/0x0, mask = 0x7f, op = assign mandatory NOTE: initiating PST update: grp 1 (DATAC1), dsk = 88/0x0, mask = 0x7f, op = assign mandatory NOTE: initiating PST update: grp 1 (DATAC1), dsk = 94/0x0, mask = 0x7f, op = assign mandatory NOTE: initiating PST update: grp 1 (DATAC1), dsk = 100/0x0, mask = 0x7f, op = assign mandatory kfdp_updateDsk(): callcnt 1766027 grp 1 PST verChk -0: req, id=3197182789, grp=1, requested=146 at 07/23/2021 05:41:27 NOTE: PST update grp = 1 completed successfully NOTE: kfdsFilter_freeDskSrSlice for Filter 0x7ff72009cfd0 NOTE: kfdsFilter_clearDskSlice for Filter 0x7ff72009cfd0 (all:TRUE) NOTE: completed online of disk group 1 disks DATAC1_0082 (82) DATAC1_0088 (88) DATAC1_0094 (94) DATAC1_0100 (100) ARB0 relocating file +DATAC1.1.1 reason 6 (1 entries first xnum 0x1) ARB0 relocating file +DATAC1.3.1 reason 6 (9 entries first xnum 0x3)
我们发现一共涉及到4个disk,分别是82/88/94/100 4个disk。从前面的trace 我们知道,之前无法进行rebalance的原因主要是卡在了disk 120上,且Oracle提示该disk pst的slot 已达到最大值,实际上通过kfed分析发现该结构最大就是20.
那么为什么巧合之际有4个盘被offline、online之后,整个diskgroup rebalance操作就恢复正常了呢?
最终我们分析发现此次offline操作的4个盘之一是88,其中该磁盘正好是120 disk的partner。我们认为offline 操作后,最终使oracle跳过了针对disk 120的一致性检查。
从这里看,我们之前给用户提供的解决方案也是符合的:
1、offline disk 120;然后online(offline、online过程不会除非rebalance,在disk repair time之内)
2、drop 120 disk force;然后手工执行rebalance。
这个案例相对比较有意思,特此简单记录一下。比较特殊的是该diskgroup 比较大,大概250TB,因为操作比较慎重。
发表回复