绑定变量超过65535 导致数据库集群频繁crash

今天凌晨5点半,被客户的电话声惊喜。对于DBA来讲,最紧张的时刻莫过于半夜接到突然突然。若非紧急故障,客户通常不会大半夜打电话。从客户提供的信息来看,数据库不断crash:

Thu Aug 25 00:05:59 2022
Exception [type: SIGSEGV, SI_KERNEL(general_protection)] [ADDR:0x0] [PC:0x186D3B7, opiaba()+639] [flags: 0x0, count: 1]
Errors in file /u02/app/oracle/diag/rdbms/gdfw/gdfw1/trace/gdfw1_ora_93981.trc  (incident=590479):
ORA-07445: exception encountered: core dump [opiaba()+639] [SIGSEGV] [ADDR:0x0] [PC:0x186D3B7] [SI_KERNEL(general_protection)] []
Incident details in: /u02/app/oracle/diag/rdbms/xxxx/xxxx1/incident/incdir_590479/xxxx1_ora_93981_i590479.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Thu Aug 25 00:06:17 2022
Dumping diagnostic data in directory=[cdmp_20220825000617], requested by (instance=1, osid=93981), summary=[incident=590479].
Thu Aug 25 00:06:19 2022
Sweep [inc][590479]: completed
Sweep [inc2][590479]: completed
Thu Aug 25 00:06:33 2022
Errors in file /u02/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_pmon_94310.trc  (incident=587614):
ORA-00600: internal error code, arguments: [17147], [0x2223138B60], [], [], [], [], [], [], [], [], [], []
Incident details in: /u02/app/oracle/diag/rdbms/xxxx/xxxx1/incident/incdir_587614/xxxx1_pmon_94310_i587614.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Thu Aug 25 00:06:36 2022
Dumping diagnostic data in directory=[cdmp_20220825000636], requested by (instance=1, osid=94310 (PMON)), summary=[incident=587614].
Errors in file /u02/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_pmon_94310.trc:
ORA-00600: internal error code, arguments: [17147], [0x2223138B60], [], [], [], [], [], [], [], [], [], []
PMON (ospid: 94310): terminating the instance due to error 472
Thu Aug 25 00:06:37 2022
ORA-1092 : opitsk aborting process
System state dump requested by (instance=1, osid=94310 (PMON)), summary=[abnormal instance termination].
System State dumped to trace file /u02/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_diag_94335_20220825000641.trc
Instance terminated by PMON, pid = 94310
Thu Aug 25 00:06:48 2022
Adjusting the default value of parameter parallel_max_servers
from 3600 to 1970 due to the value of parameter processes (2000)
Starting ORACLE instance (normal)
************************ Large Pages Information *******************
Per process system memlock (soft) limit = UNLIMITED

可以看到实例最终被pmon进程异常终止了。而ora-00600 17147 也是较为常见的。

进一步分析客户提供的trace 发现了如下内容,其中报错的存储过程传参超过7万个:

*** 2022-08-25 00:05:59.450
dbkedDefDump(): Starting a non-incident diagnostic dump (flags=0x3, level=3, mask=0x0)
----- Current SQL Statement for this session (sql_id=dz3895jnbahxj) -----
BEGIN
INSERT INTO  PSSC.T_H_SG186FAULT_EVENT
(
ID,
CITY_UNIT_NAME,
CITY_UNIT_ID,
........
SYS_GUID(),
:1 ,
:2 ,
:3 ,
:4 ,
:5 ,
:6 ,
:7 ,
:78693 ,
:78694 ,
TO_DATE(:78695 ,'YYYY-MM-DD HH24:MI:SS'),
:78696 ,
:78697 ,
TO_DATE(:78698 ,'YYYY-MM-DD HH24:MI:SS'),
:78699 ,
:78700 ,
:78701 ,
:78702 ,
:78703 ,
:78704 ,
:78705
);
COMMIT;
END;
----- Call Stack Trace -----
calling              call     entry                argument values in hex
location             type     point                (? means dubious value)
-------------------- -------- -------------------- ----------------------------
skdstdst()+41        call     kgdsdst()            000000000 ? 000000000 ?
7F046159BC40 ? 7F046159BD18 ?
7F04615A07C0 ? 000000003 ?
ksedst1()+103        call     skdstdst()           000000000 ? 000000000 ?
7F046159BC40 ? 7F046159BD18 ?
7F04615A07C0 ? 000000003 ?
ksedst()+39          call     ksedst1()            000000001 ? 000000001 ?
7F046159BC40 ? 7F046159BD18 ?
7F04615A07C0 ? 000000003 ?
dbkedDefDump()+2746  call     ksedst()             000000001 ? 000000001 ?
7F046159BC40 ? 7F046159BD18 ?
7F04615A07C0 ? 000000003 ?
ksedmp()+41          call     dbkedDefDump()       000000003 ? 000000003 ?
7F046159BC40 ? 7F046159BD18 ?
7F04615A07C0 ? 000000003 ?
ssexhd()+2462        call     ksedmp()             000000003 ? 000000003 ?
7F046159BC40 ? 7F046159BD18 ?
7F04615A07C0 ? 000000003 ?
__sighandler()       call     ssexhd()             00000000B ? 7F04615A1BF0 ?
7F04615A1AE8 ? 7F046159BD18 ?
7F04615A07C0 ? 000000003 ?
opiaba()+639         signal   __sighandler()       000003380 ? 000003380 ?
000000000 ? 000003200 ?
222311EFD8 ? 0000000A1 ?
opiprs()+307         call     opiaba()             000003380 ? 2000000226B1F0F ?
000000000 ? 000003200 ?
222311EFD8 ? 0000000A1 ?
kksParseChildCursor  call     opiprs()             7FFC39C77780 ? 7F045E3BC3E8 ?
()+572                                             00042860A ? 21CC3DE210 ?
000000004 ? 0000000A1 ?
rpiswu2()+1776       call     kksParseChildCursor  7FFC39C77960 ? 7F045E3BC3E8 ?
()                   00042860A ? 21CC3DE210 ?
000000004 ? 0000000A1 ?
kksLoadChild()+5255  call     rpiswu2()            2602464080 ? 000000027 ?
2263B6CC34 ? 000000017 ?
000000004 ? 0000000A1 ?
kxsGetRuntimeLock()  call     kksLoadChild()       00C113DA0 ? 2736366FA8 ?
+2209                                              7FFC39C78EB0 ? 000000017 ?
000000004 ? 0000000A1 ?
kksfbc()+15108       call     kxsGetRuntimeLock()  00C113DA0 ? 7F0461397608 ?
7FFC39C78EB0 ? 000000017 ?
000000004 ? 0000000A1 ?
kkspsc0()+1171       call     kksfbc()             7F0461397608 ? 000000003 ?
000000108 ? 7F045E3BC3E8 ?
000000004 ? 0000000A1 ?
kksParseCursor()+11  call     kkspsc0()            7F04613826D8 ? 7F045E3BC3E8 ?
6                                                  00042860A ? 000000003 ?
000000006 ? 0000000A4 ?
opiosq0()+2019       call     kksParseCursor()     7FFC39C79A20 ? 7F045E3BC3E8 ?
00042860A ? 000000003 ?
000000006 ? 0000000A4 ?
kpooprx()+274        call     opiosq0()            000000003 ? 00000000E ?
7FFC39C79BF0 ? 0000000A4 ?
000000006 ? 0000000A4 ?
kpoal8()+842         call     kpooprx()            7FFC39C7D3D4 ? 7F045E3BC3E8 ?
000428609 ? 000000001 ?
000000000 ? 0000000A4 ?
opiodr()+917         call     kpoal8()             00000005E ? 7F045E3BC3E8 ?
000428609 ? 000000001 ?
000000000 ? 0000000A4 ?
ttcpip()+2183        call     opiodr()             00000005E ? 00000001C ?
7FFC39C7D3D0 ? 000000001 ?
000000000 ? 0000000A4 ?
opitsk()+1710        call     ttcpip()             00C1319F0 ? 0099D5D10 ?
7FFC39C7D3D0 ? 000000000 ?
7FFC39C7CE28 ? 7FFC39C7D3CC ?
opiino()+969         call     opitsk()             00C1319F8 ? 000000000 ?
7FFC39C7D3D0 ? 000000000 ?
7FFC39C7CE28 ? 7FFC39C7D3CC ?
opiodr()+917         call     opiino()             00000003C ? 000000004 ?
7FFC39C7EBC8 ? 000000000 ?
7FFC39C7CE28 ? 7FFC39C7D3CC ?
opidrv()+570         call     opiodr()             00000003C ? 000000004 ?
7FFC39C7EBC8 ? 000000000 ?
7FFC39C7CE28 ? 7FFC39C7D3CC ?
sou2o()+103          call     opidrv()             00000003C ? 000000004 ?
7FFC39C7EBC8 ? 000000000 ?
7FFC39C7CE28 ? 7FFC39C7D3CC ?
opimai_real()+133    call     sou2o()              7FFC39C7EBA0 ? 00000003C ?
000000004 ? 7FFC39C7EBC8 ?
7FFC39C7CE28 ? 7FFC39C7D3CC ?
ssthrdmain()+265     call     opimai_real()        000000002 ? 7FFC39C7ED90 ?
000000004 ? 7FFC39C7EBC8 ?
7FFC39C7CE28 ? 7FFC39C7D3CC ?
main()+201           call     ssthrdmain()         000000002 ? 7FFC39C7ED90 ?
000000001 ? 000000000 ?
7FFC39C7CE28 ? 7FFC39C7D3CC ?
__libc_start_main()  call     main()               000000002 ? 7FFC39C7EF38 ?
+253                                               000000001 ? 000000000 ?
7FFC39C7CE28 ? 7FFC39C7D3CC ?
_start()+41          call     __libc_start_main()  000A2A354 ? 000000002 ?
7FFC39C7EF28 ? 000000000 ?
7FFC39C7CE28 ? 7FFC39C7D3CC ?

对比上述相关错误以及call stack堆栈信息,发现为Oracle bug导致。可参考如下文档:

摘取其中一段描述:

CAUSE

Instance terminated due to ora-7445 [opiaba] which leads to ora-600 [17147]. ora-7445 [opiaba] error is reported due to the use of more than 65535 binds in the same sql / plsql statement.

You may find some or all of the following function codes in the ‘Call Stack’ portion of the trace file:

opiaba opiprs rpiswu2 kksLoadChild kxsGetRuntimeLock kksfbc

This scenario is reported in bug 13973845 which is closed of duplicated bug 12578873.

紧急下载该patch发客户,打上补丁之后,数据库实例不再频繁crash。从描述来看,该patch并不能彻底解决问题,只是让实例不crash而已。最根本的解决方法还是调整应用,将存储过程拆分,不要一次性传入数万个参数,超过oracle 65535的临界值。

这个问题很简答, 我们在其他客户之前都遇到过了多次,在我司内部mes平台也能看到相关的案例分享。简单记录一下吧!


评论

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注