昨天同事遇到一个有些怪异的问题,使用了outlines,但是SQLID执行仍然出现执行计划错误的情况,
导致大量的read by other session等待,严重影响业务,最后通过revoke和grant来解决了。
下面的问题就是如何去找到当时的问题所在?换句话说,当时的那个sql语句为什么不走index scan而
选择低效的full table scan呢?
错误的执行计划如下:
Plan hash value: 1414349908
-----------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | Pstart| Pstop |
-----------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | | 100K(100)| | | |
| 1 | SORT ORDER BY | | 157K| 16M| 35M| 100K (2)| 00:20:03 | | |
| 2 | HASH UNIQUE | | 157K| 16M| 35M| 96389 (2)| 00:19:17 | | |
| 3 | HASH JOIN | | 157K| 16M| | 92578 (2)| 00:18:31 | | |
| 4 | TABLE ACCESS FULL | ACCT_ITEM_TYPE | 2817 | 73242 | | 15 (0)| 00:00:01 | | |
| 5 | HASH JOIN | | 157K| 12M| | 92561 (2)| 00:18:31 | | |
| 6 | TABLE ACCESS FULL | BILLING_CYCLE | 156 | 3900 | | 15 (0)| 00:00:01 | | |
| 7 | HASH JOIN | | 158K| 8647K| | 92544 (2)| 00:18:31 | | |
| 8 | TABLE ACCESS BY INDEX ROWID| A_PAY_OWE_STATE | 3 | 21 | | 2 (0)| 00:00:01 | | |
| 9 | INDEX RANGE SCAN | INDX_PAYMENT_METHOD_01 | 3 | | | 1 (0)| 00:00:01 | | |
| 10 | PARTITION HASH SINGLE | | 368K| 17M| | 92538 (2)| 00:18:31 | KEY | KEY |
| 11 | TABLE ACCESS FULL | ACCT_ITEM_OWE | 368K| 17M| | 92538 (2)| 00:18:31 | KEY | KEY |
-----------------------------------------------------------------------------------------------------------------------------------
下面是正确的执行计划:
Plan hash value: 2474252999
-----------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
-----------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 198 (100)| | | |
| 1 | SORT ORDER BY | | 52 | 5304 | 198 (3)| 00:00:03 | | |
| 2 | HASH UNIQUE | | 52 | 5304 | 197 (2)| 00:00:03 | | |
| 3 | HASH JOIN | | 52 | 5304 | 196 (2)| 00:00:03 | | |
| 4 | HASH JOIN | | 52 | 4004 | 181 (2)| 00:00:03 | | |
| 5 | TABLE ACCESS BY INDEX ROWID | A_PAY_OWE_STATE | 3 | 21 | 2 (0)| 00:00:01 | | |
| 6 | INDEX RANGE SCAN | INDX_PAYMENT_METHOD_01 | 3 | | 1 (0)| 00:00:01 | | |
| 7 | MERGE JOIN | | 121 | 8470 | 178 (1)| 00:00:03 | | |
| 8 | TABLE ACCESS BY INDEX ROWID | BILLING_CYCLE | 70 | 1470 | 4 (0)| 00:00:01 | | |
| 9 | INDEX FULL SCAN | PK_BILLING_CYCLE | 157 | | 1 (0)| 00:00:01 | | |
| 10 | SORT JOIN | | 132 | 6468 | 174 (1)| 00:00:03 | | |
| 11 | TABLE ACCESS BY GLOBAL INDEX ROWID| ACCT_ITEM_OWE | 132 | 6468 | 173 (0)| 00:00:03 | ROW L | ROW L |
| 12 | INDEX RANGE SCAN | IDX_ACCT_ITEM_ACCT_ID | 132 | | 9 (0)| 00:00:01 | | |
| 13 | TABLE ACCESS FULL | ACCT_ITEM_TYPE | 2845 | 71125 | 15 (0)| 00:00:01 | | |
-----------------------------------------------------------------------------------------------------------------------------------
我们可以发现,区别就在是否使用了index IDX_ACCT_ITEM_ACCT_ID,进而导致整个执行计划差别非常之大,
还可以看到正常的情况下使用了merge join。
处理问题的思路是这样的:
1. 首先查看outline是否正常使用
SQL> select owner,name,used,version,ENABLED from DBA_OUTLINES
2 where owner='ACCT';
OWNER NAME USED VERSION ENABLED
--------- ------------------------------ ------------------- --------
ACCT SYS_OUTLINE_11091615175537003 USED 10.2.0.4.0 ENABLED
-- outline正常。
2. 检查outline定义是否包含index hint IDX_ACCT_ITEM_ACCT_ID.
SQL> select ol_name, hint#, hint_text
2 from OUTLN.ol$hints
3 order by ol_name, hint#;
OL_NAME HINT# HINT_TEXT
------------------------------ ----- ------------------------------------------------------------------------------------------
SYS_OUTLINE_11091615175537003 1 USE_HASH(@"SEL$5DA710D3" "D"@"SEL$1")
SYS_OUTLINE_11091615175537003 2 USE_HASH(@"SEL$5DA710D3" "A_PAY_OWE_STATE"@"SEL$2")
SYS_OUTLINE_11091615175537003 3 USE_MERGE(@"SEL$5DA710D3" "A"@"SEL$1")
SYS_OUTLINE_11091615175537003 4 LEADING(@"SEL$5DA710D3" "H"@"SEL$1" "A"@"SEL$1" "A_PAY_OWE_STATE"@"SEL$2" "D"@"SEL$1")
SYS_OUTLINE_11091615175537003 5 FULL(@"SEL$5DA710D3" "D"@"SEL$1")
SYS_OUTLINE_11091615175537003 6 INDEX_RS_ASC(@"SEL$5DA710D3" "A_PAY_OWE_STATE"@"SEL$2" ("A_PAY_OWE_STATE"."PAYMENT_METHOD"))
SYS_OUTLINE_11091615175537003 7 INDEX_RS_ASC(@"SEL$5DA710D3" "A"@"SEL$1" ("ACCT_ITEM_OWE"."ACCT_ID"))
SYS_OUTLINE_11091615175537003 8 INDEX(@"SEL$5DA710D3" "H"@"SEL$1" ("BILLING_CYCLE"."BILLING_CYCLE_ID"))
SYS_OUTLINE_11091615175537003 9 OUTLINE(@"SEL$2")
SYS_OUTLINE_11091615175537003 10 OUTLINE(@"SEL$1")
SYS_OUTLINE_11091615175537003 11 UNNEST(@"SEL$2")
SYS_OUTLINE_11091615175537003 12 OUTLINE_LEAF(@"SEL$5DA710D3")
SYS_OUTLINE_11091615175537003 13 ALL_ROWS
SYS_OUTLINE_11091615175537003 14 OPTIMIZER_FEATURES_ENABLE('10.2.0.4')
SYS_OUTLINE_11091615175537003 15 IGNORE_OPTIM_EMBEDDED_HINTS
15 rows selected.
从INDEX_RS_ASC(@"SEL$5DA710D3" "A"@"SEL$1" ("ACCT_ITEM_OWE"."ACCT_ID"))看,outline是有的。
3. 检查相关的对象在故障期间之前是否做过相应的ddl操作
SQL> select OWNER,OBJECT_NAME,SUBOBJECT_NAME,LAST_DDL_TIME,STATUS
2 from dba_objects
3 where OBJECT_NAME='ACCT_ITEM_OWE';
OWNER OBJECT_NAME SUBOBJECT_NAME LAST_DDL_TIM STATUS
-------- ---------------- ---------------------- ------------ -------
COMM ACCT_ITEM_OWE 02-MAR-09 VALID
ACCT ACCT_ITEM_OWE P_ACCT_ITEM_OWE_0 28-FEB-09 VALID
ACCT ACCT_ITEM_OWE P_ACCT_ITEM_OWE_1 28-FEB-09 VALID
ACCT ACCT_ITEM_OWE P_ACCT_ITEM_OWE_2 28-FEB-09 VALID
ACCT ACCT_ITEM_OWE P_ACCT_ITEM_OWE_3 28-FEB-09 VALID
ACCT ACCT_ITEM_OWE P_ACCT_ITEM_OWE_4 28-FEB-09 VALID
ACCT ACCT_ITEM_OWE P_ACCT_ITEM_OWE_5 28-FEB-09 VALID
ACCT ACCT_ITEM_OWE P_ACCT_ITEM_OWE_6 28-FEB-09 VALID
ACCT ACCT_ITEM_OWE P_ACCT_ITEM_OWE_7 28-FEB-09 VALID
ACCT ACCT_ITEM_OWE P_ACCT_ITEM_OWE_8 28-FEB-09 VALID
ACCT ACCT_ITEM_OWE P_ACCT_ITEM_OWE_9 28-FEB-09 VALID
ACCT ACCT_ITEM_OWE 19-SEP-11 VALID
12 rows selected.
SQL> select OWNER,OBJECT_NAME,SUBOBJECT_NAME,LAST_DDL_TIME,STATUS from
2 dba_objects where OBJECT_NAME='IDX_ACCT_ITEM_ACCT_ID';
OWNER OBJECT_NAME SUBOBJECT_NAME LAST_DDL_TIM STATUS
----------- ----------------------- ----------------- ------------ -------
ACCT IDX_ACCT_ITEM_ACCT_ID 28-FEB-09 VALID
检查发现, 9月18号并无DDL操作。
4. 查询该SQLID是否存在多个child number或存在多个执行计划
昨天下午查询v$sql已经找不到那个SQLID了,通过如下查询发现:
SQL> select sql_id,CHILD_NUMBER,SQL_TYPE_MISMATCH,OPTIMIZER_MISMATCH,OUTLINE_MISMATCH,BIND_MISMATCH,STATS_ROW_MISMATCH
2 from V$SQL_SHARED_CURSOR
3 where SQL_ID='&id';
Enter value for id: 4u9y643v2k8fw
old 2: from V$SQL_SHARED_CURSOR where SQL_ID='&id'
new 2: from V$SQL_SHARED_CURSOR where SQL_ID='4u9y643v2k8fw'
SQL_ID CHILD_NUMBER S O O B S
------------- ------------ - - - - -
4u9y643v2k8fw 0 N N N N N
4u9y643v2k8fw 1 N N Y N N
4u9y643v2k8fw 2 N N N N N
4u9y643v2k8fw 3 N N Y N N
4u9y643v2k8fw 4 N N Y N N
4u9y643v2k8fw 5 N N Y N N
4u9y643v2k8fw 6 N N N N N
4u9y643v2k8fw 7 N N N N N
4u9y643v2k8fw 8 N N N N N
4u9y643v2k8fw 9 N N N N N
4u9y643v2k8fw 10 N N N N N
4u9y643v2k8fw 11 N N N N N
4u9y643v2k8fw 12 N N N N N
4u9y643v2k8fw 13 N N N N N
4u9y643v2k8fw 14 N N N N N
4u9y643v2k8fw 15 N N N N N
4u9y643v2k8fw 16 N N N N N
4u9y643v2k8fw 17 N N N N N
4u9y643v2k8fw 18 N N N N N
4u9y643v2k8fw 19 N N N N N
4u9y643v2k8fw 20 N N N N N
21 rows selected.
备注:这里的第2个O是指OUTLINE_MISMATCH。
可以发现该sqlid出现了多个child cursor,另外检查发现参数cursor_sharing设置为simlair了。
该sqlid sql version肯定也较高。
这里有一点让我不解的是,通过DBA_HIST_*相关视图查询,发现当时出问题的时候的sqlid的child number
跟V$SQL_SHARED_CURSOR并不相同,当然这也不能完全说明问题,很可能sql已经被flush out多次了。
通过查hist视图,可以发现当时故障期间出现大量的read by other session等待:
EVENT COUNT(*) SESSION_ID SESSION SQL_ID SQL_CHILD_NUMBER SQL_PLAN_HASH_VALUE FORCE_MATCHING_SIGNATURE TIME_WAITED
----------------------- --------- ---------- ------- ------------- ---------------- ------------------- ------------------------ -----------
db file scattered read 1 1382 WAITING 4u9y643v2k8fw 18 1414349908 0 4502
read by other session 1 1398 WAITING 4u9y643v2k8fw 18 1414349908 0 8099
read by other session 1 1404 WAITING 4u9y643v2k8fw 18 1414349908 0 89
db file scattered read 1 1424 WAITING 4u9y643v2k8fw 18 1414349908 0 8483
read by other session 1 1633 WAITING 4u9y643v2k8fw 18 1414349908 0 10470
read by other session 1 2081 WAITING 4u9y643v2k8fw 18 1414349908 4.3893E+18 2047
read by other session 1 2239 WAITING 4u9y643v2k8fw 18 1414349908 0 5403
read by other session 1 685 WAITING 4u9y643v2k8fw 18 1414349908 0 11577
read by other session 1 1094 WAITING 4u9y643v2k8fw 18 1414349908 0 6340
read by other session 1 1168 WAITING 4u9y643v2k8fw 18 1414349908 4.3893E+18 1427
read by other session 1 1304 WAITING 4u9y643v2k8fw 18 1414349908 0 12656
5. 查看sqlid是否使用了outline
SQL> select SQL_ID,CHILD_NUMBER
2 from v$sql_plan
3 where sql_id='4u9y643v2k8fw';
SQL_ID CHILD_NUMBER
------------- ------------
4u9y643v2k8fw 12
4u9y643v2k8fw 12
.....
4u9y643v2k8fw 12
4u9y643v2k8fw 2
4u9y643v2k8fw 2
4u9y643v2k8fw 2
......
4u9y643v2k8fw 2
4u9y643v2k8fw 2
28 rows selected.
下午查询时只有number为2 12
SQL> SELECT info outline_name
2 FROM (SELECT ROWNUM r, EXTRACTVALUE (VALUE (d), '/info') info
3 FROM v$sql_plan p,
4 TABLE (XMLSEQUENCE (EXTRACT (XMLTYPE (p.other_xml),
5 '/other_xml/info'
6 )
7 )
8 ) d
9 WHERE other_xml IS NOT NULL
10 AND ID = 1
11 AND sql_id = '4u9y643v2k8fw'
12 AND child_number = 3)
13 WHERE r = 4;
no rows selected
SQL> SELECT info outline_name
2 FROM (SELECT ROWNUM r, EXTRACTVALUE (VALUE (d), '/info') info
3 FROM v$sql_plan p,
4 TABLE (XMLSEQUENCE (EXTRACT (XMLTYPE (p.other_xml),
5 '/other_xml/info'
6 )
7 )
8 ) d
9 WHERE other_xml IS NOT NULL
10 AND ID = 1
11 AND sql_id = '4u9y643v2k8fw'
12 AND child_number = 4)
13 WHERE r = 4;
no rows selected
SQL> SELECT info outline_name
2 FROM (SELECT ROWNUM r, EXTRACTVALUE (VALUE (d), '/info') info
3 FROM v$sql_plan p,
4 TABLE (XMLSEQUENCE (EXTRACT (XMLTYPE (p.other_xml),
5 '/other_xml/info'
6 )
7 )
8 ) d
9 WHERE other_xml IS NOT NULL
10 AND ID = 1
11 AND sql_id = '4u9y643v2k8fw'
12 AND child_number = 5)
13 WHERE r = 4;
no rows selected
SQL> SELECT info outline_name
2 FROM (SELECT ROWNUM r, EXTRACTVALUE (VALUE (d), '/info') info
3 FROM v$sql_plan p,
4 TABLE (XMLSEQUENCE (EXTRACT (XMLTYPE (p.other_xml),
5 '/other_xml/info'
6 )
7 )
8 ) d
9 WHERE other_xml IS NOT NULL
10 AND ID = 1
11 AND sql_id = '4u9y643v2k8fw'
12 AND child_number = 2)
13 WHERE r = 4;
OUTLINE_NAME
--------------------------------------------------------------------------------
"SYS_OUTLINE_11091615175537003"
SQL> SELECT info outline_name
2 FROM (SELECT ROWNUM r, EXTRACTVALUE (VALUE (d), '/info') info
3 FROM v$sql_plan p,
4 TABLE (XMLSEQUENCE (EXTRACT (XMLTYPE (p.other_xml),
5 '/other_xml/info'
6 )
7 )
8 ) d
9 WHERE other_xml IS NOT NULL
10 AND ID = 1
11 AND sql_id = '4u9y643v2k8fw'
12 AND child_number = 12)
13 WHERE r = 4;
OUTLINE_NAME
--------------------------------------------------------------------------------
"SYS_OUTLINE_11091615175537003"
可以看到,该sql的当前的2个child cursor都是使用了outline的。
但是这说明不了问题,既然该sql存在多个child cursor,那么必然存在多个plan hash value值。
这里我们需要去查询那个出问题的sqlid对应的plan_hash_value所对应的执行计划。
虽然说是有awr快照,但是通过这个方式已经查不到信息了。
select * from table(dbms_xplan.display_awr('&sql_id','&child_number',null,'advanced'));
我这里使用的如下的查询方式:
SQL> set long 1000000
SQL> set linesize 200
SQL> SELECT plan_hash_value, XMLTYPE (other_xml).EXTRACT ('/*')
2 FROM dba_hist_sql_plan
3 WHERE sql_id = '4u9y643v2k8fw'
4 AND plan_hash_value IN (1414349908)
5 AND DBMS_LOB.SUBSTR (other_xml, 1000) LIKE '%
10.2.0.4
1414349908
c10c
c103
313042
PLAN_HASH_VALUE
---------------
XMLTYPE(OTHER_XML).EXTRACT('/*')
--------------------------------------------------------------------------------------------------------------------------------
313044
c6481504081c43
PLAN_HASH_VALUE
---------------
XMLTYPE(OTHER_XML).EXTRACT('/*')
--------------------------------------------------------------------------------------------------------------------------------
PLAN_HASH_VALUE
---------------
XMLTYPE(OTHER_XML).EXTRACT('/*')
--------------------------------------------------------------------------------------------------------------------------------
从上面的查询来看,当时是使用了outline,但是问题出在这里:
也就是说正常情况下,这3条当中至少有1条是ACCT_ID的index rang scan才对。
今天上午,和同事沟通,说又出问题了,同事将参数cursor_sharing调为exact了,即默认值。
虽然这样会导致很多sql语句的硬解析,但是听他说上午调整以后到11点时,还没出现问题。
跟itpub版主棉花糖ONE交流了一下,他认为很可能是outline和cursor_sharing的bug。
但是我仍然觉得cursor_sharing可能不是真正的原因,因为ACCT_ID列是应用绑定变量列,
并不是通过参数cursor_sharing的设置来实现绑定变量的。
最后总结一下:
9i以后都不推荐再用outline了,感觉不太稳定,推荐使用sql profile来固定执行计划。
至于最后的结论到底是outline的bug还是cursor_sharing=similar的bug,现在还确定不了,只是可能性非常大。
发表回复