今天某客户反馈说其中一套业务系统数据库实例crash重启了,通过分析了日志发现报错如下:
Tue Oct 18 21:34:34 2022 Errors in file /u01/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_lmon_2491932.trc (incident=512156): ORA-00600: internal error code, arguments: [kghstack_underflow_internal_2], [0x1108E6388], [], [], [], [], [], [], [], [], [], [] Incident details in: /u01/app/oracle/diag/rdbms/xxxx/xxxx1/incident/incdir_512156/xxxx1_lmon_2491932_i512156.trc Tue Oct 18 21:34:44 2022 Dumping diagnostic data in directory=[cdmp_20221018213444], requested by (instance=1, osid=2491932 (LMON)), summary=[incident=512156]. Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Errors in file /u01/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_lmon_2491932.trc: ORA-00600: internal error code, arguments: [kghstack_underflow_internal_2], [0x1108E6388], [], [], [], [], [], [], [], [], [], [] LMON (ospid: 2491932): terminating the instance due to error 481 Tue Oct 18 21:34:44 2022 opiodr aborting process unknown ospid (7210544) as a result of ORA-1092 Tue Oct 18 21:34:44 2022 opiodr aborting process unknown ospid (5047352) as a result of ORA-1092 Tue Oct 18 21:34:44 2022 opiodr aborting process unknown ospid (10618076) as a result of ORA-1092 Tue Oct 18 21:34:44 2022 ORA-1092 : opitsk aborting process Tue Oct 18 21:34:44 2022 ORA-1092 : opitsk aborting process Tue Oct 18 21:34:44 2022 opiodr aborting process unknown ospid (5441064) as a result of ORA-1092 Tue Oct 18 21:34:44 2022 ORA-1092 : opitsk aborting process Tue Oct 18 21:34:49 2022 Instance terminated by LMON, pid = 2491932 Tue Oct 18 21:34:53 2022 Starting ORACLE instance (normal)
可以看到实例被LMON进程给异常终止了,详细内容还需要进一步看lmon trace内容:
*** SERVICE NAME:(SYS$BACKGROUND) 2022-10-18 21:34:34.668 *** MODULE NAME:() 2022-10-18 21:34:34.668 *** ACTION NAME:() 2022-10-18 21:34:34.668 Dump continued from file: /u01/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_lmon_2491932.trc ORA-00600: internal error code, arguments: [kghstack_underflow_internal_2], [0x1108E6388], [], [], [], [], [], [], [], [], [], [] ========= Dump for incident 512156 (ORA 600 [kghstack_underflow_internal_2]) ======== *** 2022-10-18 21:34:34.691 dbkedDefDump(): Starting incident default dumps (flags=0x2, level=3, mask=0x0) ----- SQL Statement (None) ----- Current SQL information unavailable - no cursor. ----- Call Stack Trace ----- calling call entry argument values in hex location type point (? means dubious value) -------------------- -------- -------------------- ---------------------------- skdstdst()+40 bl 0000000109B4CD24 000000000 ? 000000001 ? 000000003 ? 000000000 ? 000000000 ? 000000001 ? 000000003 ? 000000000 ? ksedst1()+112 call skdstdst() 171F2D30C8558AB1 ? 4844284100000000 ? FFFFFFFFFFF6500 ? 28E4DEBE4CBF3 ? 10A81AD8C ? 000000000 ? 11072A8C0 ? 2050033FFFF6508 ? ksedst()+40 call ksedst1() 000000000 ? 00000000A ? 000003000 ? 10A5BFFA8 ? 000000000 ? 000000000 ? 000002004 ? 000000001 ? dbkedDefDump()+1516 call ksedst() 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? 300000003 ? ksedmp()+72 call dbkedDefDump() 31072A8C0 ? 110000A60 ? FFFFFFFFFFF6D10 ? 1106AC1B8 ? 100125838 ? FFFFFFFFFFF7730 ? 1000F0D94 ? 1106AC1B8 ? ksfdmp()+100 call ksedmp() 000000002 ? 000000000 ? 000000002 ? 10AAE5CB0 ? 10A07CFD0 ? 000000000 ? 1109D3E30 ? 11072A8C0 ? dbgexPhaseII()+1904 call ksfdmp() 000000000 ? 00000000A ? 000000002 ? 000000000 ? 000000002 ? 10A07CFC8 ? 000000000 ? 001050005 ? dbgexProcessError() call dbgexPhaseII() 11072A8C0 ? 1109D2040 ? +1556 00007D09C ? 200000000 ? FFFFFFFFFFF7C28 ? 000000082 ? 000000000 ? 000000000 ? dbgeExecuteForError call dbgexProcessError() 11072A8C0 ? 1109D3E30 ? ()+72 1FFFFB6A0 ? 000000001 ? 000000703 ? 000000011 ? 000000006 ? 1109D5B78 ? dbgePostErrorKGE()+ call dbgeExecuteForError 000000000 ? 00A4D1050 ? 2044 () FFFFFFFFFFFFB210 ? 00A4D1050 ? 000000000 ? 90000000D6969D8 ? 000000000 ? 110000C58 ? dbkePostKGE_kgsf()+ call dbgePostErrorKGE() 000003000 ? 10A5BFFA8 ? 68 25800000002 ? 109E85570 ? 000000000 ? 000000000 ? FFFFFFFFFFFBEE0 ? 11113A600 ? kgeadse()+380 call dbkePostKGE_kgsf() 102DA1484 ? 100000000 ? FFFFFFFFFFFC0D8 ? 000000000 ? 110AED1A0 ? 1108EA610 ? 000000002 ? 700000000013680 ? kgerinv_internal()+ call kgeadse() 000000000 ? 000000000 ? 48 000000000 ? 1700000010 ? 100000000 ? 000003000 ? 110D33350 ? 1108EA610 ? kgerinv()+48 call kgerinv_internal() 8311AABF3BAF ? 8311AABF3FD4 ? 8311AABF3BAF ? 8311AABF3BAF ? 000000000 ? 10A5A3090 ? 000000000 ? 000000000 ? kgeasnmierr()+72 call kgerinv() 000000000 ? 000000023 ? 000000001 ? 000000004 ? 000000000 ? 000000001 ? 110D33350 ? 110AED398 ? kghstack_underflow_ call kgeasnmierr() 000000000 ? FFFFFFFFFFFC100 ? internal()+280 00000001E ? 100000001 ? 000000002 ? 1108E6388 ? 000000000 ? 000000000 ? kghstack_free()+716 call kghstack_underflow_ 000000001 ? 08DBD1E85 ? internal() 700011351BB7B48 ? 0000F4240 ? 000000000 ? 00000000A ? 000003000 ? 10A5BFFA8 ? kccgrd()+264 call kghstack_free() FFFFFFFFFFFC0C0 ? 4224282B00000000 ? 103D2C888 ? 000004000 ? 500000005 ? C0000000C ? 400003000 ? 10A5BFFA8 ? kjxgrf_rr_read()+66 call kccgrd() 1FFFD02FAFF35E5 ? 110A5BD70 ? 0 FFFFFFFFFFFC180 ? 000000000 ? 110A5BD70 ? 110FBCF48 ? 0037D6E50 ? 1106AC1B8 ? kjxgrDD_rr_read()+1 call kjxgrf_rr_read() 110A032D0 ? 700011342677E98 ? 04 000000000 ? 000000001 ? FFFFFFFFFFFC6A4 ? 110A03B38 ? FFFFFFFFFFFC630 ? 42245280FFFFC790 ? kjxgrimember()+124 call kjxgrDD_rr_read() 000003000 ? 10A5BFFA8 ? 000000002 ? 700000000013680 ? 11011EAD0 ? FFFFFFFFFFFCD80 ? 000000001 ? 218DBD1E85 ? kjxggpoll()+804 call kjxgrimember() FFFFFFFFFFFC6D0 ? 0000186A0 ? 101FECE90 ? 8311AABD8A40 ? 70000000000C0D0 ? 000000000 ? 000001568 ? 100000000 ? kjfmact()+508 call kjxggpoll() 000000000 ? 000000000 ? 000000000 ? 000000000 ? FFFFFFFFFFFC7A0 ? 000000000 ? 1037BB124 ? 000000000 ? kjfdact()+32 call kjfmact() 11011EAD0 ? FFFFFFFFFFFCD80 ? 000000001 ? 000000000 ? 002050000 ? 001160000 ? 10896F91E ? 14616E27FFFFC930 ? kjfcln()+2240 call kjfdact() 000000000 ? 10A5B3014 ? 700011351BB7B48 ? 000000002 ? 700011351BB7B54 ? 000000004 ? 2FFFFF570 ? 200000002 ? ksbrdp()+2216 call kjfcln() 700000000013198 ? 7000000000131B4 ? 048245028 ? 000000E00 ? 1108B2310 ? 100638128 ? 000000001 ? 700000007 ? opirip()+1620 call ksbrdp() FFFFFFFFFFFFEA7 ? 10B2ADCF0 ? FFFFFFFFFFFDE50 ? 000000000 ? 000000001 ? 000000000 ? 01099067F ? 000000001 ? opidrv()+608 call opirip() 10AE1FAC0 ? 410134198 ? FFFFFFFFFFFEFC0 ? 2F7530312F ? 108354684 ? 1106AC1B8 ? 7264626D732F6462 ? 1106AC1B8 ? sou2o()+136 call opidrv() 32067E1DB0 ? 4FFFFF388 ? FFFFFFFFFFFEFC0 ? 25001D022C0000 ? 000000010 ? 1106AC1B8 ? 000000000 ? 000000000 ? opimai_real()+188 call sou2o() FFFFFFFFFFFF030 ? 5524445B00000001 ? 9000000000DC64C ? BADC0FFEE0DDF00D ? 000000003 ? 9001000A008DB98 ? A0000000A000000 ? 10B6B6F40 ? ssthrdmain()+276 call opimai_real() FFFFFFFFFFFF110 ? 9001000A0092DC0 ? FFFFFFFFFFFF130 ? 10B6F72B8 ? 90000000008AB0C ? 9001000A008DB98 ? FFFFFFFFFFFF110 ? 9001000A008DB98 ? main()+204 call ssthrdmain() 3F0003720 ? FFFFFFFFFFFF478 ? FFFFFFFFFFFF4E0 ? 9FFFFFFF000D6F0 ? 9FFFFFFF00009E0 ? 000000000 ? 000000000 ? 9FFFFFFF000D6F0 ? __start()+112 call main() 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ?
跟进前面的call stack信息,很容易定位到如下的bug,详细内容可以参考mos的文章:
SYMPTOMS
- The LMON or LMS process crash the instance with an error like:
ORA-00600: internal error code, arguments: [kghstack_underflow_internal_2], [0x110A10838], [], [], [], [], [], [], [], [], [], []
ORA-1092 : opitsk aborting process
Instance terminated by LMS1, pid = 14024818 - Review of the generated tracefiles reveals a call stack similar to:
… kghstack_underflow_internal kghstack_free kccgrd kjxgrf_rr_read kjxgrDD_rr_read kjxgrimember kjxggpoll kjfmact kjfdact kjfcln ksbrdp …
– OR –
… kghstack_underflow_internal kghstack_free ktundo kturcrbackoutonechg ktrgcm ktrget3 ktrget2 kclgcr …
CHANGES
CAUSE
The cause of this problem has been identified in a.o.:
Bug 18687067 – ORA-600 [KGHSTACK_UNDERFLOW_INTERNAL_2]
closed as duplicate of Bug 20675347 – ORA-07445 [KGHSTACK_OVERFLOW_INTERNAL()+644]
The bug is caused by an AIX compiler issue causing volatile variables in the Oracle kernel not to be handled properly.
The bug is a regression introduced in 11.2.0.4.
The issue does not reproduce in later versions, i.e. 12.1.
SOLUTION
To solve the issue, use any of below alternatives:
- Upgrade to 12.1
– OR –
- Apply interim patch 20675347, if available for your platform and Oracle version.
To check for conflicting patches, please use the MOS Patch Planner Tool
Please refer to
Note 1317012.1 – How To Use MOS Patch Planner To Check And Request The Conflict Patches?If no patch exists for your version, please contact Oracle Support for a backport request.
从文档来看,该问题在11.2.0.4还是比较常见,主要是该用户没有安装相应的PSU。问题相对简单,简单记录一下,以备后查!
发表回复