Oracle Crash due to Controlfile sequence# reached 0xffffffff

近期某客户的一个Oracle数据库的controlfile sequence#居然达到了最大值(42亿 即 0xffffffff);导致数据库直接crash,然后再次启动时发现报错;

Sun Aug 08 08:21:58 2021
Thread 1 advanced to log sequence 662736 (LGWR switch)
  Current log# 5 seq# 662736 mem# 0: +DATADG/xxxx/onlinelog/group_5.401.961749619
Sun Aug 08 08:22:20 2021
*************************************************************************
ATTENTION: The controlfile sequence# has now reached 4294967295 (0xffffffff),
           which is the architectural limit. Further controlfile updates
           are no longer possible. To resume normal database operation
           it is necessary to shutdown abort all instances and perform
           the steps described in Doc ID 20324049.8 at My Oracle Support
           to reset the controlfile sequence# to 1.
*************************************************************************
Errors in file /u03/db/diag/rdbms/xxxx/xxxx1/trace/xxxx1_ora_20578722.trc  (incident=843103):
ORA-00600: internal error code, arguments: [kccfhb_3], [4294967295], [], [], [], [], [], [], [], [], [], []
Incident details in: /u03/db/diag/rdbms/xxxx/xxxx1/incident/incdir_843103/xxxx1_ora_20578722_i843103.trc
Sun Aug 08 08:22:22 2021
Dumping diagnostic data in directory=[cdmp_20210808082222], requested by (instance=1, osid=20578722), summary=[incident=843103].
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
......
ORA-1092 : opitsk aborting process
Sun Aug 08 08:22:41 2021
ORA-1092 : opitsk aborting process
Instance terminated by CKPT, pid = 23068730
Sun Aug 08 08:22:44 2021
......
Sun Aug 08 08:23:08 2021
ALTER DATABASE MOUNT /* db agent *//* {0:9:46000} */
This instance was first to mount
Sun Aug 08 08:23:08 2021
Reconfiguration complete
*************************************************************************
ATTENTION: The controlfile sequence# has now reached 4294967295 (0xffffffff),
           which is the architectural limit. Further controlfile updates
           are no longer possible. To resume normal database operation
           it is necessary to shutdown abort all instances and perform
           the steps described in Doc ID 20324049.8 at My Oracle Support
           to reset the controlfile sequence# to 1.
*************************************************************************
*************************************************************************
ATTENTION: The controlfile sequence# has now reached 4294967295 (0xffffffff),
           which is the architectural limit. Further controlfile updates
           are no longer possible. To resume normal database operation
           it is necessary to shutdown abort all instances and perform
           the steps described in Doc ID 20324049.8 at My Oracle Support
           to reset the controlfile sequence# to 1.
*************************************************************************
Errors in file /u03/db/diag/rdbms/xxxx/xxxx1/trace/xxxx1_ora_51380254.trc  (incident=864827):
ORA-00600: internal error code, arguments: [kccfhb_3], [4294967295], [], [], [], [], [], [], [], [], [], []
Incident details in: /u03/db/diag/rdbms/xxxx/xxxx1/incident/incdir_864827/xxxx1_ora_51380254_i864827.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Sun Aug 08 08:23:14 2021
Dumping diagnostic data in directory=[cdmp_20210808082314], requested by (instance=1, osid=51380254), summary=[incident=864827].
ORA-600 signalled during: ALTER DATABASE MOUNT /* db agent *//* {0:9:46000} */...
Sun Aug 08 08:23:15 2021
Shutting down instance (abort)
License high water mark = 2
USER (ospid: 49610866): terminating the instance
Instance terminated by USER, pid = 49610866

 

从上述Oracle alert log来看,Oracle提到了一个文档,查看该文档会发现,Oracle认为该问题是Bug导致,同时也提供了解决方案;其解决方案也不复杂,如下:

1. verify the controlfile sequence# is at or above 0xFF000000:

set numwidth 15

select max(FHCSQ)
from x$kcvfh;

2. Generate Trace file to recreate the controlfile:
    alter database backup controlfile to trace noresetlogs;
3. shutdown
4. startup nomount
5. alter session set events '20324049 trace name context forever, level 1';
6. execute the commands in the tracefile generated by step#2
7. alter session set events '20324049 trace name context off';
8. confirm the controlfile sequence# is now low with the same query in 1.

 

简单的讲,即先设置event 20324049;然后再重建controlfile,直接open即可;但是在进行重建controlfile时遇到了如下问题:

but not in the data dictionary. Deleting from controlfile.
Errors in file /u03/db/diag/rdbms/xxxx/xxxx1/trace/xxxx1_ora_7930228.trc  (incident=936822):
ORA-00600: internal error code, arguments: [25016], [193], [52], [], [], [], [], [], [], [], [], []
Incident details in: /u03/db/diag/rdbms/xxxx/xxxx1/incident/incdir_936822/xxxx1_ora_7930228_i936822.trc
Sun Aug 08 13:43:50 2021
Dumping diagnostic data in directory=[cdmp_20210808134350], requested by (instance=1, osid=7930228), summary=[incident=936822].
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /u03/db/diag/rdbms/xxxx/xxxx1/trace/xxxx1_ora_7930228.trc:
ORA-00600: internal error code, arguments: [25016], [193], [52], [], [], [], [], [], [], [], [], []
Errors in file /u03/db/diag/rdbms/xxxx/xxxx1/trace/xxxx1_ora_7930228.trc:
ORA-00600: internal error code, arguments: [25016], [193], [52], [], [], [], [], [], [], [], [], []
Error 600 happened during db open, shutting down database
USER (ospid: 7930228): terminating the instance due to error 600
Instance terminated by USER, pid = 7930228
ORA-1092 signalled during: alter database open resetlogs...
opiodr aborting process unknown ospid (7930228) as a result of ORA-1092
Sun Aug 08 13:43:54 2021

上述错误提示其实是较为明显的,大概含义是指open数据库时发现有表空间信息丢失。不难看出该数据库在之前应该drop 过一些tablespace。在重建控制文件之后,打开数据库时,表空间的ts#必须是连续的,否则会抛出上述错误。那么怎么解决该问题呢?

我们有个大致的思路,先想办法把丢失的几条记录补充到ts$即可。这里我们通过dbx -a PID来进行操作,设置kokiasg 断点,insert 丢失的5条记录之后,再重启数据库即可。重启之后open数据库发现报ora-00600 [2662];该错误就非常简单了,这里不再多说。oradebug poke 推进scn即可。

最后再重启数据库,去重建一次controlfile,设置event 处理max sequence#的问题;在alert log中会看到如下类似信息:

CREATE CONTROLFILE is being performed with event 20324049 set, this will now reset the controlfile sequence number to 1 in the datafiles, the online redo logs, and the new controlfile.
datafile 1 header updated: controlfile sequence# reset to 1
NOTE: Loaded library: System
SUCCESS: diskgroup DATADG was mounted
datafile 2 header updated: controlfile sequence# reset to 1
datafile 3 header updated: controlfile sequence# reset to 1
datafile 4 header updated: controlfile sequence# reset to 1
datafile 5 header updated: controlfile sequence# reset to 1
datafile 6 header updated: controlfile sequence# reset to 1
datafile 7 header updated: controlfile sequence# reset to 1
datafile 8 header updated: controlfile sequence# reset to 1
datafile 9 header updated: controlfile sequence# reset to 1

最后打开数据库即可。至此该case也算是完结了。总的来讲,整个恢复过程不算太复杂。

因为比较少见,我本人也是第一次遇到将controlfile sequence#撑满的情况,比较有趣,所以简单记录一下,跟大家分享!


评论

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注