未格式化的坏块导致数据库备份失败

今天一位同事咨询说某客户一套核心数据库出现坏块,导致NBU备份失败;他通过rman进行blockrecover 报错,如下:

---rman validate检查报错坏块
SQL> SELECT * FROM v$database_block_corruption;

     FILE#     BLOCK#     BLOCKS CORRUPTION_CHANGE# CORRUPTIO
---------- ---------- ---------- ------------------ ---------
        15    1185624          8                  0 CORRUPT
        15    1185652          4                  0 CORRUPT
        15    1187096          4                  0 CORRUPT

---通过blockrecover恢复报错

RMAN> run{
2> ALLOCATE CHANNEL ch00 TYPE 'SBT_TAPE';
3> SEND 'NB_ORA_SERV=nbu_master, NB_ORA_CLIENT=ybbred12';
4> blockrecover datafile 15 block 1187096;
5> RELEASE CHANNEL ch00;
6> }

using target database control file instead of recovery catalog
allocated channel: ch00
channel ch00: SID=1618 instance=billog2 device type=SBT_TAPE
channel ch00: Veritas NetBackup for Oracle - Release 7.7.3 (2016051920)

sent command to channel: ch00

Starting recover at 2020-03-19 11:07:12

channel ch00: restoring block(s)
channel ch00: specifying block(s) to restore from backup set
restoring blocks of datafile 00015
channel ch00: reading from backup piece bk_115329_1_1035085787
channel ch00: ORA-19870: error while restoring backup piece bk_115329_1_1035085787
ORA-19507: failed to retrieve sequential file, handle="bk_115329_1_1035085787", parms=""
ORA-27029: skgfrtrv: sbtrestore returned error
ORA-19511: Error received from media manager layer, error text:
   Backup file <bk_115329_1_1035085787> not found in NetBackup catalog
.......

从上述报错来看,rman提示找不到备份集合。对NBU有一点点经验;通过bplist查看最近几天对备份确实没有。

ybbred12[/]#bplist -l -C byyred12 -t 4 -s  03/15/2020 00:00:00  -e 03/15/2020 23:00:00  -R /
EXIT STATUS 227: no entity was found
ybbred12[/]#bplist -l -C byyred12 -t 4 -s  03/15/2020 00:00:00  -e 03/18/2020 23:00:00  -R /
EXIT STATUS 227: no entity was found

然后我对数据块进行了dump发现内容完全不对。

*** 2020-03-19 12:17:33.639
Start dump data block from file +DATA/billog/datafile/tbs_cen_ilog.279.907377523 minblk 1185625 maxblk 1185625
 V10 STYLE FILE HEADER:
        Compatibility Vsn = 186646528=0xb200000
        Db ID=3752648572=0xdfacdf7c, Db Name='BILLOG'
        Activation ID=0=0x0
        Control Seq=24409607=0x1747607, File size=4188672=0x3fea00
        File Number=15, Blksiz=8192, File Type=3 DATA
Dump all the blocks in range:
Encrypted block <11, 3135232634> content will not be dumped. Dumping header only.
buffer tsn: 11 rdba: 0xbadfda7a (747/2087546)
scn: 0xbadf.badfda7a seq: 0xda flg: 0x7a tail: 0xbadfda7a
frmt: 0x07 chkval: 0xbadf type: 0xba=unknown
Hex dump of corrupt header 4 = CORRUPT
Dump of memory from 0x0000000110860E00 to 0x0000000110860E14
110860E00 BADFDA7A BADFDA7A BADFDA7A BADFDA7A  [...z...z...z...z]
110860E10 BADFDA7A                             [...z]
End dump data block from file +DATA/billog/datafile/tbs_cen_ilog.279.907377523 minblk 1185625 maxblk 1185625

因此我怀疑这几个坏块可能是未格式化的坏块。进一步通过查询dba_extents进行了确认:

SQL> set lines 200
SQL> col segment_name for a30
SQL> col segment_type for a20
SQL> col owner for a20
SQL> select segment_name,segment_type,owner
  2  from dba_Extents
  3  where file_id=15
  4  and 1185624 between block_id and block_id + blocks -1;

no rows selected

SQL> c/5624/5662
  4* and 1185662 between block_id and block_id + blocks -1
SQL> /

no rows selected

SQL> c/5662/7100
  4* and 118877100 between block_id and block_id + blocks -1
SQL> /

no rows selected

SQL> select * from (select owner,segment_name,file_id,extent_id,block_id,blocks from
  2  dba_extents where file_id=15 and block_id > 1185000) where rownum < 20;

no rows selected

SQL> c/1185000/1170000
  2* dba_extents where file_id=15 and block_id > 1170000) where rownum < 20
SQL>
SQL> l
  1  select * from (select owner,segment_name,file_id,extent_id,block_id,blocks from
  2* dba_extents where file_id=15 and block_id > 1170000) where rownum < 20
SQL>
SQL> c/20/100
  2* dba_extents where file_id=15 and block_id > 1170000) where rownum < 100
SQL> l
  1  select * from (select owner,segment_name,file_id,extent_id,block_id,blocks from
  2* dba_extents where file_id=15 and block_id > 1170000) where rownum < 100
SQL>

看来这个数据文件可能很大一部分根本没被使用;至少目前没有被使用。我们可以查看该数据文件的高水位信息来验证这一点:

SQL> select *
  2    from (select /*+ ordered use_hash(a,b,c) */
  3           a.file_id,
  4           a.file_name,
  5           a.filesize,
  6           b.freesize,
  7           (a.filesize - b.freesize) usedsize,
  8           c.hwmsize,
  9           c.hwmsize - (a.filesize - b.freesize) unsedsize_belowhwm,
 10           a.filesize - c.hwmsize canshrinksize
 11            from (select file_id,
 12                         file_name,
 13                         round(bytes / 1024 / 1024) filesize
 14                    from dba_data_files) a,
 15                 (select file_id, round(sum(dfs.bytes) / 1024 / 1024) freesize
 16                    from dba_free_space dfs
 17                   group by file_id) b,
 18                 (select file_id, round(max(block_id) * 8 / 1024) HWMsize
 19                    from dba_extents
 20                   group by file_id) c
 21           where a.file_id = b.file_id
 22             and a.file_id = c.file_id
 23           order by unsedsize_belowhwm desc)
 24   where file_id in (select file_id
 25                       from dba_data_files
 26                      where file_id=15)
 27   order by file_id;

   FILE_ID FILE_NAME                                                      FILESIZE   FREESIZE   USEDSIZE    HWMSIZE UNSEDSIZE_BELOWHWM CANSHRINKSIZE
---------- ------------------------------------------------------------ ---------- ---------- ---------- ---------- ------------------ -------------
        15 +DATA/billog/datafile/tbs_cen_ilog.279.907377523                  32724      31512       1212       1470                258         31254

 

不难看出目前该数据文件的HWM大概在1400M左右;而前面报错的坏块编号1185624*8192/1024/1024 大概是9400M左右;显然这是有问题的。

当然对于这个问题,我们可以其实不做任何处理,等业务表的数据逐渐增加;会自动去格式化处理掉这部分坏块,不过这样会影响备份。也就是说在坏块处理好之前备份可能都无法正常进行,除非指定allow corrption参数。

那么针对这种简单问题其实有3种比较简单的处理方法:

1、通过resize文件来完成

2、通过创建测试表,allocate extent到指定文件,并往表中插入大量数据,格式化坏块后,再drop table

3、通过bbed copy 空块直接覆盖即可(copy 覆盖需要修改rdba地址)。

 

很显然,针对这次的情况,resize文件是最简单的了。我们来看看效果:

SQL> alter database datafile 15 resize 9000m;

Database altered.

RMAN> validate datafile 15;

Starting validate at 2020-03-19 14:36:28
using channel ORA_DISK_1
channel ORA_DISK_1: starting validation of datafile
channel ORA_DISK_1: specifying datafile(s) for validation
input datafile file number=00015 name=+DATA/billog/datafile/tbs_cen_ilog.279.907377523
channel ORA_DISK_1: validation complete, elapsed time: 00:00:45
List of Datafiles
=================
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
15   OK     0              0            1152000         14080921213487
  File Name: +DATA/billog/datafile/tbs_cen_ilog.279.907377523
  Block Type Blocks Failing Blocks Processed
  ---------- -------------- ----------------
  Data       0              0
  Index      0              1149254
  Other      0              2746

非常简单的小问题。跟大家分享一下,就当是流水账吧!

 


评论

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注