今天凌晨5点半,被客户的电话声惊喜。对于DBA来讲,最紧张的时刻莫过于半夜接到突然突然。若非紧急故障,客户通常不会大半夜打电话。从客户提供的信息来看,数据库不断crash:
Thu Aug 25 00:05:59 2022 Exception [type: SIGSEGV, SI_KERNEL(general_protection)] [ADDR:0x0] [PC:0x186D3B7, opiaba()+639] [flags: 0x0, count: 1] Errors in file /u02/app/oracle/diag/rdbms/gdfw/gdfw1/trace/gdfw1_ora_93981.trc (incident=590479): ORA-07445: exception encountered: core dump [opiaba()+639] [SIGSEGV] [ADDR:0x0] [PC:0x186D3B7] [SI_KERNEL(general_protection)] [] Incident details in: /u02/app/oracle/diag/rdbms/xxxx/xxxx1/incident/incdir_590479/xxxx1_ora_93981_i590479.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Thu Aug 25 00:06:17 2022 Dumping diagnostic data in directory=[cdmp_20220825000617], requested by (instance=1, osid=93981), summary=[incident=590479]. Thu Aug 25 00:06:19 2022 Sweep [inc][590479]: completed Sweep [inc2][590479]: completed Thu Aug 25 00:06:33 2022 Errors in file /u02/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_pmon_94310.trc (incident=587614): ORA-00600: internal error code, arguments: [17147], [0x2223138B60], [], [], [], [], [], [], [], [], [], [] Incident details in: /u02/app/oracle/diag/rdbms/xxxx/xxxx1/incident/incdir_587614/xxxx1_pmon_94310_i587614.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Thu Aug 25 00:06:36 2022 Dumping diagnostic data in directory=[cdmp_20220825000636], requested by (instance=1, osid=94310 (PMON)), summary=[incident=587614]. Errors in file /u02/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_pmon_94310.trc: ORA-00600: internal error code, arguments: [17147], [0x2223138B60], [], [], [], [], [], [], [], [], [], [] PMON (ospid: 94310): terminating the instance due to error 472 Thu Aug 25 00:06:37 2022 ORA-1092 : opitsk aborting process System state dump requested by (instance=1, osid=94310 (PMON)), summary=[abnormal instance termination]. System State dumped to trace file /u02/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_diag_94335_20220825000641.trc Instance terminated by PMON, pid = 94310 Thu Aug 25 00:06:48 2022 Adjusting the default value of parameter parallel_max_servers from 3600 to 1970 due to the value of parameter processes (2000) Starting ORACLE instance (normal) ************************ Large Pages Information ******************* Per process system memlock (soft) limit = UNLIMITED
可以看到实例最终被pmon进程异常终止了。而ora-00600 17147 也是较为常见的。
进一步分析客户提供的trace 发现了如下内容,其中报错的存储过程传参超过7万个:
*** 2022-08-25 00:05:59.450 dbkedDefDump(): Starting a non-incident diagnostic dump (flags=0x3, level=3, mask=0x0) ----- Current SQL Statement for this session (sql_id=dz3895jnbahxj) ----- BEGIN INSERT INTO PSSC.T_H_SG186FAULT_EVENT ( ID, CITY_UNIT_NAME, CITY_UNIT_ID, ........ SYS_GUID(), :1 , :2 , :3 , :4 , :5 , :6 , :7 , :78693 , :78694 , TO_DATE(:78695 ,'YYYY-MM-DD HH24:MI:SS'), :78696 , :78697 , TO_DATE(:78698 ,'YYYY-MM-DD HH24:MI:SS'), :78699 , :78700 , :78701 , :78702 , :78703 , :78704 , :78705 ); COMMIT; END; ----- Call Stack Trace ----- calling call entry argument values in hex location type point (? means dubious value) -------------------- -------- -------------------- ---------------------------- skdstdst()+41 call kgdsdst() 000000000 ? 000000000 ? 7F046159BC40 ? 7F046159BD18 ? 7F04615A07C0 ? 000000003 ? ksedst1()+103 call skdstdst() 000000000 ? 000000000 ? 7F046159BC40 ? 7F046159BD18 ? 7F04615A07C0 ? 000000003 ? ksedst()+39 call ksedst1() 000000001 ? 000000001 ? 7F046159BC40 ? 7F046159BD18 ? 7F04615A07C0 ? 000000003 ? dbkedDefDump()+2746 call ksedst() 000000001 ? 000000001 ? 7F046159BC40 ? 7F046159BD18 ? 7F04615A07C0 ? 000000003 ? ksedmp()+41 call dbkedDefDump() 000000003 ? 000000003 ? 7F046159BC40 ? 7F046159BD18 ? 7F04615A07C0 ? 000000003 ? ssexhd()+2462 call ksedmp() 000000003 ? 000000003 ? 7F046159BC40 ? 7F046159BD18 ? 7F04615A07C0 ? 000000003 ? __sighandler() call ssexhd() 00000000B ? 7F04615A1BF0 ? 7F04615A1AE8 ? 7F046159BD18 ? 7F04615A07C0 ? 000000003 ? opiaba()+639 signal __sighandler() 000003380 ? 000003380 ? 000000000 ? 000003200 ? 222311EFD8 ? 0000000A1 ? opiprs()+307 call opiaba() 000003380 ? 2000000226B1F0F ? 000000000 ? 000003200 ? 222311EFD8 ? 0000000A1 ? kksParseChildCursor call opiprs() 7FFC39C77780 ? 7F045E3BC3E8 ? ()+572 00042860A ? 21CC3DE210 ? 000000004 ? 0000000A1 ? rpiswu2()+1776 call kksParseChildCursor 7FFC39C77960 ? 7F045E3BC3E8 ? () 00042860A ? 21CC3DE210 ? 000000004 ? 0000000A1 ? kksLoadChild()+5255 call rpiswu2() 2602464080 ? 000000027 ? 2263B6CC34 ? 000000017 ? 000000004 ? 0000000A1 ? kxsGetRuntimeLock() call kksLoadChild() 00C113DA0 ? 2736366FA8 ? +2209 7FFC39C78EB0 ? 000000017 ? 000000004 ? 0000000A1 ? kksfbc()+15108 call kxsGetRuntimeLock() 00C113DA0 ? 7F0461397608 ? 7FFC39C78EB0 ? 000000017 ? 000000004 ? 0000000A1 ? kkspsc0()+1171 call kksfbc() 7F0461397608 ? 000000003 ? 000000108 ? 7F045E3BC3E8 ? 000000004 ? 0000000A1 ? kksParseCursor()+11 call kkspsc0() 7F04613826D8 ? 7F045E3BC3E8 ? 6 00042860A ? 000000003 ? 000000006 ? 0000000A4 ? opiosq0()+2019 call kksParseCursor() 7FFC39C79A20 ? 7F045E3BC3E8 ? 00042860A ? 000000003 ? 000000006 ? 0000000A4 ? kpooprx()+274 call opiosq0() 000000003 ? 00000000E ? 7FFC39C79BF0 ? 0000000A4 ? 000000006 ? 0000000A4 ? kpoal8()+842 call kpooprx() 7FFC39C7D3D4 ? 7F045E3BC3E8 ? 000428609 ? 000000001 ? 000000000 ? 0000000A4 ? opiodr()+917 call kpoal8() 00000005E ? 7F045E3BC3E8 ? 000428609 ? 000000001 ? 000000000 ? 0000000A4 ? ttcpip()+2183 call opiodr() 00000005E ? 00000001C ? 7FFC39C7D3D0 ? 000000001 ? 000000000 ? 0000000A4 ? opitsk()+1710 call ttcpip() 00C1319F0 ? 0099D5D10 ? 7FFC39C7D3D0 ? 000000000 ? 7FFC39C7CE28 ? 7FFC39C7D3CC ? opiino()+969 call opitsk() 00C1319F8 ? 000000000 ? 7FFC39C7D3D0 ? 000000000 ? 7FFC39C7CE28 ? 7FFC39C7D3CC ? opiodr()+917 call opiino() 00000003C ? 000000004 ? 7FFC39C7EBC8 ? 000000000 ? 7FFC39C7CE28 ? 7FFC39C7D3CC ? opidrv()+570 call opiodr() 00000003C ? 000000004 ? 7FFC39C7EBC8 ? 000000000 ? 7FFC39C7CE28 ? 7FFC39C7D3CC ? sou2o()+103 call opidrv() 00000003C ? 000000004 ? 7FFC39C7EBC8 ? 000000000 ? 7FFC39C7CE28 ? 7FFC39C7D3CC ? opimai_real()+133 call sou2o() 7FFC39C7EBA0 ? 00000003C ? 000000004 ? 7FFC39C7EBC8 ? 7FFC39C7CE28 ? 7FFC39C7D3CC ? ssthrdmain()+265 call opimai_real() 000000002 ? 7FFC39C7ED90 ? 000000004 ? 7FFC39C7EBC8 ? 7FFC39C7CE28 ? 7FFC39C7D3CC ? main()+201 call ssthrdmain() 000000002 ? 7FFC39C7ED90 ? 000000001 ? 000000000 ? 7FFC39C7CE28 ? 7FFC39C7D3CC ? __libc_start_main() call main() 000000002 ? 7FFC39C7EF38 ? +253 000000001 ? 000000000 ? 7FFC39C7CE28 ? 7FFC39C7D3CC ? _start()+41 call __libc_start_main() 000A2A354 ? 000000002 ? 7FFC39C7EF28 ? 000000000 ? 7FFC39C7CE28 ? 7FFC39C7D3CC ?
对比上述相关错误以及call stack堆栈信息,发现为Oracle bug导致。可参考如下文档:
摘取其中一段描述:
CAUSE
Instance terminated due to ora-7445 [opiaba] which leads to ora-600 [17147]. ora-7445 [opiaba] error is reported due to the use of more than 65535 binds in the same sql / plsql statement.
You may find some or all of the following function codes in the ‘Call Stack’ portion of the trace file:
opiaba opiprs rpiswu2 kksLoadChild kxsGetRuntimeLock kksfbc
This scenario is reported in bug 13973845 which is closed of duplicated bug 12578873.
紧急下载该patch发客户,打上补丁之后,数据库实例不再频繁crash。从描述来看,该patch并不能彻底解决问题,只是让实例不crash而已。最根本的解决方法还是调整应用,将存储过程拆分,不要一次性传入数万个参数,超过oracle 65535的临界值。
这个问题很简答, 我们在其他客户之前都遇到过了多次,在我司内部mes平台也能看到相关的案例分享。简单记录一下吧!
发表回复