Oracle ASM Recover Case:
Tue Mar 04 18:09:59 CST 2014 Errors in file /home/oraprod/10.2.0/db/admin/xxxx/bdump/xxxx_lgwr_15943.trc: ORA-00345: redo log write error block 68145 count 5 ORA-00312: online log 6 thread 2: '+DATA/xxxx/onlinelog/o2_t2_redo3.log' ORA-15078: ASM diskgroup was forcibly dismounted Tue Mar 04 18:09:59 CST 2014 SUCCESS: diskgroup DATA was dismounted SUCCESS: diskgroup DATA was dismounted Tue Mar 04 18:10:00 CST 2014 Errors in file /home/oraprod/10.2.0/db/admin/xxxx/bdump/xxxx_lmon_15892.trc: ORA-00202: control file: '+DATA/xxxx/controlfile/o1_mf_4g1zr1yo_.ctl' ORA-15078: ASM diskgroup was forcibly dismounted Tue Mar 04 18:10:00 CST 2014 KCF: write/open error block=0x1f41e online=1 file=31 +DATA/xxxx/datafile/apps_ts_queues.310.692585175 error=15078 txt: '' Tue Mar 04 18:10:00 CST 2014 KCF: write/open error block=0x47d5d online=1 file=51 +DATA/xxx/datafile/apps_ts_tx_data.353.692593409 error=15078 txt: '' Tue Mar 04 18:10:00 CST 2014 Errors in file /home/oraprod/10.2.0/db/admin/xxxx/bdump/xxxx_dbw2_15939.trc: ORA-00202: control file: '+DATA/prod/controlfile/o1_mf_4g1zr1yo_.ctl' ORA-15078: ASM diskgroup was forcibly dismounted Tue Mar 04 18:10:00 CST 2014 KCF: write/open error block=0x47d5b online=1 file=51 +DATA/prod/datafile/apps_ts_tx_data.353.692593409 error=15078 txt: '' Tue Mar 04 18:10:00 CST 2014
asm instance alert log as fellow:
NOTE: SMON starting instance recovery for group 1 (mounted) Tue Mar 04 18:10:04 CST 2014 WARNING: IO Failed. au:0 diskname:/dev/raw/raw5 rq:0x200000000207b518 buffer:0x200000000235c600 au_offset(bytes):0 iosz:4096 operation:0 status:2 WARNING: IO Failed. au:0 diskname:/dev/raw/raw5 rq:0x200000000207b518 buffer:0x200000000235c600 au_offset(bytes):0 iosz:4096 operation:0 status:2 NOTE: F1X0 found on disk 0 fcn 0.160230519 WARNING: IO Failed. au:33 diskname:/dev/raw/raw5 rq:0x60000000002d64f0 buffer:0x400405df000 au_offset(bytes):0 iosz:4096 operation:0 status:2 WARNING: cache failed to read gn 1 fn 3 blk 10752 count 1 from disk 2 ERROR: cache failed to read fn=3 blk=10752 from disk(s): 2 ORA-15081: failed to submit an I/O operation to a disk NOTE: cache initiating offline of disk 2 group 1 WARNING: process 12863 initiating offline of disk 2.2526420198 (DATA_0002) with mask 0x3 in group 1 NOTE: PST update: grp = 1, dsk = 2, mode = 0x6 Tue Mar 04 18:10:04 CST 2014 ERROR: too many offline disks in PST (grp 1) Tue Mar 04 18:10:04 CST 2014 ERROR: PST-initiated MANDATORY DISMOUNT of group DATA Tue Mar 04 18:10:04 CST 2014 WARNING: Disk 2 in group 1 in mode: 0x7,state: 0x2 was taken offline Tue Mar 04 18:10:05 CST 2014 NOTE: halting all I/Os to diskgroup DATA NOTE: active pin found: 0x0x40045bb0fd0 Tue Mar 04 18:10:05 CST 2014 Abort recovery for domain 1 Tue Mar 04 18:10:05 CST 2014 NOTE: cache dismounting group 1/0xD916EC16 (DATA) Tue Mar 04 18:10:06 CST 2014
Oracle kfed read disk:
WARNING: cache read a corrupted block gn=1 dsk=2 blk=1 from disk 2 OSM metadata block dump: kfbh.endian: 0 ; 0x000: 0x00 kfbh.hard: 0 ; 0x001: 0x00 kfbh.type: 0 ; 0x002: KFBTYP_INVALID kfbh.datfmt: 0 ; 0x003: 0x00 kfbh.block.blk: 0 ; 0x004: T=0 NUMB=0x0 kfbh.block.obj: 0 ; 0x008: TYPE=0x0 NUMB=0x0 kfbh.check: 0 ; 0x00c: 0x00000000 kfbh.fcn.base: 0 ; 0x010: 0x00000000 kfbh.fcn.wrap: 0 ; 0x014: 0x00000000 kfbh.spare1: 0 ; 0x018: 0x00000000 kfbh.spare2: 0 ; 0x01c: 0x00000000 CE: (0x0x400417ee4e0) group=1 (DATA) obj=2 (disk) blk=1 hashFlags=0x0002 lid=0x0002 lruFlags=0x0000 bastCount=1 redundancy=0x11 fileExtent=-2147483648 AUindex=0 blockIndex=1 copy #0: disk=2 au=0 BH: (0x0x40041795000) bnum=4586 type=reading state=reading chgSt=not modifying flags=0x00000000 pinmode=excl lockmode=share bf=0x0x40041400000 kfbh_kfcbh.fcn_kfbh = 0.0 lowAba=655.8572 highAba=0.0 last kfcbInitSlot return code=null cpkt lnk is null
Finally, we use the self-developed database extraction tool to recover all the Oracle data files.
Distributed storage failure:
Tue Mar 03 11:40:49 2020 ERROR: too many offline disks in PST (grp 1) Tue Mar 03 11:40:49 2020 ...... NOTE: cache closing disk 122 of grp 1: (not open) DG_DATA01_0122 ERROR: disk 35(DG_DATA01_0035) in group 1(DG_DATA01) cannot be offlined because all disks [35(DG_DATA01_0035), 83(DG_DATA01_0083)] with mirrored data would be offline. Tue Mar 03 11:40:49 2020 ERROR: too many offline disks in PST (grp 1) Tue Mar 03 11:40:49 2020 NOTE: cache dismounting (not clean) group 1/0xFF593AAC (DG_DATA01) NOTE: messaging CKPT to quiesce pins Unix process pid: 94856, image: oracle@zhjqc01 (B000) Tue Mar 03 11:40:49 2020 NOTE: cache closing disk 0 of grp 1: (not open) DG_DATA01_0000 Tue Mar 03 11:40:49 2020 NOTE: halting all I/Os to diskgroup 1 (DG_DATA01) Tue Mar 03 11:40:49 2020 NOTE: cache closing disk 80 of grp 1: (not open) DG_DATA01_0080 Tue Mar 03 11:40:49 2020 NOTE: cache closing disk 82 of grp 1: (not open) DG_DATA01_0082 Tue Mar 03 11:40:49 2020 WARNING: Offline of disk 35 (DG_DATA01_0035) in group 1 and mode 0x7f failed on ASM inst 1 Tue Mar 03 11:40:49 2020 NOTE: cache closing disk 83 of grp 1: (not open) DG_DATA01_0083 Tue Mar 03 11:40:50 2020 ...... NOTE: LGWR doing non-clean dismount of group 1 (DG_DATA01) thread 2 NOTE: LGWR sync ABA=583.8239 last written ABA 583.8239 Tue Mar 03 11:40:50 2020 NOTE: initiating dirty detach from lock domain 1 Tue Mar 03 11:40:50 2020 ....... Tue Mar 03 11:40:50 2020 Dirty Detach Reconfiguration complete (total time 0.3 secs) ASM Health Checker found 1 new failures Tue Mar 03 11:40:51 2020 ERROR: ORA-15130 in COD recovery for diskgroup 1/0xff593aac (DG_DATA01) ERROR: ORA-15130 thrown in RBAL for group number 1 Tue Mar 03 11:40:51 2020 Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_55928.trc: ORA-15130: diskgroup "DG_DATA01" is being dismounted Tue Mar 03 11:40:54 2020
The following error message is displayed when you manually mount the diskgroup:
ERROR: diskgroup DG_DATA01 was not mounted ORA-15032: not all alterations performed ORA-15040: diskgroup is incomplete ORA-15066: offlining disk "35" in group "DG_DATA01" may result in a data loss ORA-15042: ASM disk "35" is missing from group number "1" ORA-15080: synchronous I/O operation failed to read block 0 of disk 13 in disk group ORA-27061: waiting for async I/Os failed Linux-x86_64 Error: 5: Input/output error Additional information: 4294967295 Additional information: 4096
We manually fixed the above error through kfed, and the solution was tested as follows:
SQL> alter diskgroup test mount force; alter diskgroup test mount force * ERROR at line 1: ORA-15032: not all alterations performed ORA-15040: diskgroup is incomplete ORA-15066: offlining disk "1" in group "TEST" may result in a data loss ORA-15042: ASM disk "1" is missing from group number "2" =============== PST ==================== grpNum: 2 state: 2 callCnt: 91 (lockvalue) valid=1 ver=0.0 ndisks=2 flags=0x3 from inst=0 (I am 1) last=0 --------------- HDR -------------------- next: 46 last: 46 pst count: 2 pst locations: 2 0 incarn: 45 dta size: 6 version: 1 ASM version: 186646528 = 11.2.0.0.0 contenttype: 1 partnering pattern: [ ] --------------- LOC MAP ---------------- 0: dirty 0 cur_loc: 0 stable_loc: 0 1: dirty 1 cur_loc: 1 stable_loc: 1 --------------- DTA -------------------- 0: sts v v(rw) p(rw) a(x) d(x) fg# = 1 addTs = 2429200834 parts: 5 (amp) 4 (amp) 3 (amp) 2 (amp) 1: sts v v(-w) p(-w) a(-) d(-) fg# = 1 addTs = 2429200834 parts: 4 (amp) 5 (amp) 2 (amp) 3 (amp) 2: sts v v(rw) p(rw) a(x) d(x) fg# = 2 addTs = 2429386451 parts: 1 (amp) 4 (amp) 5 (amp) 0 (amp) 3: sts v v(rw) p(rw) a(x) d(x) fg# = 2 addTs = 2429386451 parts: 5 (amp) 0 (amp) 1 (amp) 4 (amp) 4: sts v v(--) p(--) a(-) d(-) fg# = 3 addTs = 2429203972 parts: 1 (amp) 0 (amp) 2 (amp) 3 (amp) 5: sts v v(--) p(--) a(-) d(-) fg# = 3 addTs = 2429203972 parts: 0 (amp) 1 (amp) 3 (amp) 2 (amp) [grid@Prod ~]$ kfed read /dev/asm-diske aun=1 blkn=2|grep "status"|grep -v "I=0" kfdpDtaEv1[0].status: 127 ; 0x000: I=1 V=1 V=1 P=1 P=1 A=1 D=1 kfdpDtaEv1[1].status: 127 ; 0x030: I=1 V=1 V=1 P=1 P=1 A=1 D=1 kfdpDtaEv1[2].status: 127 ; 0x060: I=1 V=1 V=1 P=1 P=1 A=1 D=1 kfdpDtaEv1[3].status: 127 ; 0x090: I=1 V=1 V=1 P=1 P=1 A=1 D=1 kfdpDtaEv1[4].status: 21 ; 0x0c0: I=1 V=0 V=1 P=0 P=1 A=0 D=0 kfdpDtaEv1[5].status: 21 ; 0x0f0: I=1 V=0 V=1 P=0 P=1 A=0 D=0 [grid@Prod ~]$ kfed read /dev/asm-diskc aun=1 blkn=2|grep "status"|grep -v "I=0" kfdpDtaEv1[0].status: 127 ; 0x000: I=1 V=1 V=1 P=1 P=1 A=1 D=1 kfdpDtaEv1[1].status: 127 ; 0x030: I=1 V=1 V=1 P=1 P=1 A=1 D=1 kfdpDtaEv1[2].status: 127 ; 0x060: I=1 V=1 V=1 P=1 P=1 A=1 D=1 kfdpDtaEv1[3].status: 127 ; 0x090: I=1 V=1 V=1 P=1 P=1 A=1 D=1 kfdpDtaEv1[4].status: 21 ; 0x0c0: I=1 V=0 V=1 P=0 P=1 A=0 D=0 kfdpDtaEv1[5].status: 21 ; 0x0f0: I=1 V=0 V=1 P=0 P=1 A=0 D=0 [grid@test ~]$ kfed read /dev/asm-diskh aun=1 blkn=2|grep "status"|grep -v "I=0" kfdpDtaEv1[0].status: 127 ; 0x000: I=1 V=1 V=1 P=1 P=1 A=1 D=1 kfdpDtaEv1[1].status: 127 ; 0x030: I=1 V=1 V=1 P=1 P=1 A=1 D=1 kfdpDtaEv1[2].status: 127 ; 0x060: I=1 V=1 V=1 P=1 P=1 A=1 D=1 kfdpDtaEv1[3].status: 127 ; 0x090: I=1 V=1 V=1 P=1 P=1 A=1 D=1 kfdpDtaEv1[4].status: 127 ; 0x0c0: I=1 V=0 V=1 P=1 P=1 A=1 D=1 kfdpDtaEv1[5].status: 127 ; 0x0f0: I=1 V=1 V=1 P=1 P=1 A=1 D=1 [grid@test ~]$ kfed read /dev/asm-diskh aun=1 blkn=2|grep "status"|grep -v "I=0" > repair.txt [grid@test ~]$ kfed merge /dev/asm-diske aun=1 blkn=2 text=repair.txt [grid@test ~]$ kfed merge /dev/asm-diskc aun=1 blkn=2 text=repair.txt [grid@test ~]$ kfed merge /dev/asm-diske aun=1 blkn=3 text=repair.txt [grid@test ~]$ kfed merge /dev/asm-diskc aun=1 blkn=3 text=repair.txt SQL> alter diskgroup test mount force; Diskgroup altered.
Storage array damage
Mon Apr 27 02:01:15 2020 Read of datafile '+DATADG/orclsk/datafile/sysaux.260.956077231' (fno 2) header failed with ORA-01207 Rereading datafile 2 header from mirror side 'DATADG_0000' failed with ORA-01207 Rereading datafile 2 header from mirror side 'DATADG_0006' failed with ORA-01207 Errors in file /u01/app/oracle/diag/rdbms/orclsk/orclsk1/trace/orclsk1_ckpt_4893.trc: ORA-63999: data file suffered media failure ORA-01122: database file 2 failed verification check ORA-01110: data file 2: '+DATADG/orclsk/datafile/sysaux.260.956077231' ORA-01207: file is more recent than control file - old control file Errors in file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_ckpt_4893.trc: ORA-63999: data file suffered media failure ORA-01122: database file 2 failed verification check ORA-01110: data file 2: '+DATADG/xxxxx/datafile/sysaux.260.956077231' ORA-01207: file is more recent than control file - old control file CKPT (ospid: 4893): terminating the instance due to error 63999
The customer restarted the database environment and found that a large number of errors were still reported.
Mon Apr 27 02:02:03 2020 Read of datafile '+DATADG/xxxxx/datafile/mscsz.dbf' (fno 40) header failed with ORA-01207 Rereading datafile 40 header from mirror side 'DATADG_0000' failed with ORA-01207 Rereading datafile 40 header from mirror side 'DATADG_0005' failed with ORA-01207 Errors in file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_dbw0_115709.trc: ORA-01186: file 40 failed verification tests ORA-01122: database file 40 failed verification check ORA-01110: data file 40: '+DATADG/xxxxx/datafile/mscsz.dbf' ORA-01207: file is more recent than control file - old control file File 40 not verified due to error ORA-01122 ....... Mon Apr 27 02:02:07 2020 QMNC started with pid=46, OS id=115887 Errors in file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_ora_115771.trc (incident=299881): ORA-01578: ORACLE data block corrupted (file # 1, block # 3620) ORA-01110: data file 1: '+DATADG/xxxxx/datafile/system.259.956077225' Incident details in: /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/incident/incdir_299881/xxxxx1_ora_115771_i299881.trc Mon Apr 27 02:02:16 2020 Dumping diagnostic data in directory=[cdmp_20200427020216], requested by (instance=1, osid=115771), summary=[incident=299881]. Mon Apr 27 02:02:17 2020 Sweep [inc][299881]: completed Mon Apr 27 02:02:18 2020 Completed: ALTER DATABASE OPEN /* db agent *//* {0:13:1899} */ Mon Apr 27 02:02:18 2020 ORA-21700 encountered when generating server alert SMG-4120 ORA-21700 encountered when generating server alert SMG-4121 ORA-21700 encountered when generating server alert SMG-4121 Mon Apr 27 02:02:20 2020 Starting background process CJQ0 Mon Apr 27 02:02:20 2020 CJQ0 started with pid=54, OS id=115990 Mon Apr 27 02:02:20 2020 Errors in file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_m004_115984.trc (incident=300025): ORA-01578: ORACLE data block corrupted (file # 2, block # 101561) ORA-01110: data file 2: '+DATADG/xxxxx/datafile/sysaux.260.956077231' Incident details in: /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/incident/incdir_300025/xxxxx1_m004_115984_i300025.trc Dumping diagnostic data in directory=[cdmp_20200427020222], requested by (instance=1, osid=115984 (M004)), summary=[incident=300025]. Mon Apr 27 02:02:22 2020 Sweep [inc][300025]: completed Errors in file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_m004_115984.trc (incident=300026): ORA-01578: ORACLE data block corrupted (file # , block # ) Incident details in: /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/incident/incdir_300026/xxxxx1_m004_115984_i300026.trc Errors in file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_m004_115984.trc (incident=300027): ORA-01578: ORACLE data block corrupted (file # , block # ) ORA-06512: at "SYS.PRVT_ADVISOR", line 6144 ORA-06512: at line 1 Incident details in: /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/incident/incdir_300027/xxxxx1_m004_115984_i300027.trc Dumping diagnostic data in directory=[cdmp_20200427020224], requested by (instance=1, osid=115984 (M004)), summary=[incident=300026]. Dumping diagnostic data in directory=[cdmp_20200427020224], requested by (instance=1, osid=115984 (M004)), summary=[incident=300027]. Mon Apr 27 02:02:46 2020 Hex dump of (file 1, block 24895) in trace file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_ora_116134.trc Corrupt block relative dba: 0x0040613f (file 1, block 24895) Bad header found during buffer read Data in bad block: type: 6 format: 2 rdba: 0x00405c3f last change scn: 0x0000.0002a412 seq: 0x1 flg: 0x06 spare1: 0x0 spare2: 0x0 spare3: 0x0 consistency value in tail: 0xa4120601 check value in block header: 0xa486 computed block checksum: 0x0 Reading datafile '+DATADG/xxxxx/datafile/system.259.956077225' for corruption at rdba: 0x0040613f (file 1, block 24895) Read datafile mirror 'DATADG_0001' (file 1, block 24895) found same corrupt data (no logical check) Read datafile mirror 'DATADG_0003' (file 1, block 24895) found valid data Hex dump of (file 1, block 24895) in trace file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_ora_116134.trc Repaired corruption at (file 1, block 24895)
Finally, we successfully restored the normal operation of this business system through some means.