On Friday 13 of June 2012, I was an on-call Oracle DBA. I got to work on a high priority client’s request. One of our client’s financial team was executing final steps of a year closing process in their ERP production environment and had a problem. The step that I received a complain about took 2 hours and still didn’t complete. I was told that last year the same step completed in 20 minutes. Looking back, I think it may be beneficial for some of you to read how the issue was troubleshooted and resolved. It may give you a chance to learn something or share some better diagnosing techniques with me and others. Please do not hesitate to use the comments section of this blog post.
Pressure
I should mention that the whole troubleshooting story from the first call to the final resolution took ~3 hours. I bet I spent 2 hours out of 3 on communications. There were 4 different persons (including 2 managers) I had to discuss with and inform on the investigation progress. One of the managers was on a conference call with several other senior people from our client’s financial team. At some point in time, the manager was asking for a status update on 2-5 minute intervals. At this stage, you should get an idea of the amount of pressure I had on me while working on the incident. This was one of those cases where it doesn’t matter how technically correct/beautiful/precise your actions or solutions are. The only thing that matters is how quickly you resolve the issue.
Initial Call
It all started with a page from the client. A person asked to call him to discuss an issue. From the call, I learned the following:
- There is a batch job (EBS – Concurrent Request) executing in the production environment.
- The last time the job executed successfully was a year ago (read = no execution history, should start the troubleshooting from scratch).
- It previously took 20 minutes to execute.
- At the time, it had been running 2+ hours and still didn’t complete.
- As usual in such cases, I was asked to first check if the job wasn’t blocked by anything, if there was enough temp space, etc.
I promised to have an initial look and get back to the person with the case assessment. It wasn’t too difficult to find a database session associated with the Concurrent Request. (I used the run_cr.sql script from my personal SQL library.)
Troubleshooting
Next thing that I typically check in such cases is SEQ# and some others fields from V$SESSION view to get some initial idea of what we are dealing with. I got the following initial information:
SQL> select SEQ#, SQL_ID, PREV_SQL_ID, MODULE, LAST_CALL_ET, EVENT, P1, WAIT_TIME, SECONDS_IN_WAIT, STATE, SQL_TRACE from v$session where sid=2542; SEQ# SQL_ID PREV_SQL_ID MODULE LAST_CALL_ET --------------- ------------- ------------- ---------------------------------------------------------------- --------------- EVENT P1 WAIT_TIME SECONDS_IN_WAIT STATE SQL_TRAC ---------------------------------------------------------------- --------------- --------------- --------------- ------------------- -------- 3814 48tu5cy75d7z4 UPGLLEZL 9106 latch: cache buffers chains 17408691056 -1 5 WAITED SHORT TIME DISABLED 1 row selected. SQL> / SEQ# SQL_ID PREV_SQL_ID MODULE LAST_CALL_ET --------------- ------------- ------------- ---------------------------------------------------------------- --------------- EVENT P1 WAIT_TIME SECONDS_IN_WAIT STATE SQL_TRAC ---------------------------------------------------------------- --------------- --------------- --------------- ------------------- -------- 3814 48tu5cy75d7z4 UPGLLEZL 9203 latch: cache buffers chains 17408691056 -1 102 WAITED SHORT TIME DISABLED 1 row selected.
Notes: No diagnostic pack licence was available in this environment. During the research activities, I had a quick check with “Beginning Performance Tuning” article from Arup Nanda (Tweeter, Blog) to quickly validate my understanding of V$SESSION view’s fields.
We can learn quite a bit from the two lines above. Some of you would notice that there was a 97 second interval between the two V$SESSION snapshots (LAST_CALL_ET 9203 – 9106). The session was executing a single of call for 2h and 33 minutes (LAST_CALL_ET 9203 seconds). It was working (running on CPU) and not waiting on anything, and the session didn’t call any other SQL between the V$SESSION snapshots (SEQ# is the same). However, there was one unusual thing in this output: No value was provided for the SQL_ID field (currently executing SQL). What was the session executing? Was the session executing an SQL? If it was, what’s the SQL?
How to find the current SQL if SQL_ID is null
The first idea I have tried out was switching on 10046 event (SQL trace) and see if I can figure out something from there. This is that I found in a trace file:
*** 2012-07-12 20:39:49.065 WAIT #47325300207976: nam='db file sequential read' ela= 9835 file#=156 block#=930982 blocks=1 obj#=393791 tim=1342139989065730 WAIT #47325300207976: nam='db file sequential read' ela= 621 file#=156 block#=930983 blocks=1 obj#=393791 tim=1342139989066806 WAIT #47325300207976: nam='db file sequential read' ela= 24 file#=156 block#=930984 blocks=1 obj#=393791 tim=1342139989067076 *** 2012-07-12 20:40:30.282 WAIT #47325300207976: nam='latch: cache buffers chains' ela= 84 address=17411387456 number=155 tries=0 obj#=393791 tim=1342140030282022 *** 2012-07-12 20:41:32.846 WAIT #47325300207976: nam='db file sequential read' ela= 6258 file#=156 block#=930869 blocks=1 obj#=393791 tim=1342140092846169 *** 2012-07-12 20:44:50.771 WAIT #47325300207976: nam='db file sequential read' ela= 6512 file#=146 block#=1023949 blocks=1 obj#=393791 tim=1342140290771848 *** 2012-07-12 20:44:51.785 WAIT #47325300207976: nam='db file sequential read' ela= 5782 file#=156 block#=931047 blocks=1 obj#=393791 tim=1342140291785475 *** 2012-07-12 20:45:49.957 WAIT #47325300207976: nam='db file sequential read' ela= 6098 file#=156 block#=930851 blocks=1 obj#=393791 tim=1342140349957004 WAIT #47325300207976: nam='db file sequential read' ela= 3614 file#=156 block#=930853 blocks=1 obj#=393791 tim=1342140349961811
As you see, it didn’t give me any clue on what SQL the session was executing. Let me ask you: What bits of useful information can we get from the output above? First of all, there were relatively rare and short wait events (tim, ela). Any single wait took a tiny fraction of time (max ela = 9835 = 9ms). File#=156 for all db file sequential read events was the same. I found that the file associated with file#=156 was an index file using the SQL bellow and that the other obj#=393791 parameter (it was an 11.2.0.2 database) pointed to one of the DB object the session was working with.
SQL> select name from v$datafile where file#=156; NAME -------------------------------------------------------------------------------- /u01/.path./APPS_TS_TX_IDX19.dbf SQL> select owner, object_name, object_type from dba_objects where object_id=393791; OWNER ------------------------------ OBJECT_NAME -------------------------------------------------------------------------------- OBJECT_TYPE ------------------- GL GL_CODE_COMBINATIONS_N7 INDEX
At this stage, everything indicated that the session was probably running a single SQL. (It was using a lot of CPU and from time-to-time read data from an INDEX’s blocks on disk, etc.) However, for one reason or another, Oracle failed to show the current SQL under V$SESSION view.
I was running out of ideas. (Keep in mind that I was under pressure.) What else can I quickly try out without any additional research? The client was demanding results, so I decided to use a “Phone A Friend” option (c) “Who Wants to Be a Millionaire?” :) At Pythian, we have a very good option to escalate a problem to a “sword” team’s representatives in case of emergencies. At that time, I called Gleb Otochkin (Twitter, Blog). Gleb is a Cluster Technical Lead at Pythian (one position below STO). He is very knowledgeable and easy to talk to. We had just a 3-minute conversation, and it was enough to generate a few good ideas.
Have a look at the following output:
SQL> set lines 180 pages 1000 numwidth 15 select a.sql_id, a.executions, a.CPU_TIME, a.DISK_READS, a.BUFFER_GETS from V$SQLAREA a where a.sql_id in (select o.sql_id from V$OPEN_CURSOR o where o.sid=2542) order by a.BUFFER_GETS; SQL> 2 SQL_ID EXECUTIONS CPU_TIME DISK_READS BUFFER_GETS ------------- ------------ --------------- ------------ ------------- 3sypbzc2t2pg8 338 9998 0 0 ... 48tu5cy75d7z4 28 4489318 30139 420631 ggvskj3ku51s7 59872 4119356 86 538879 a9811wy55zu3u 170601 6818972 1731 703579 fmfdkztk8vx2 502530 20736841 3458 1926394 2tqd4wxg82v80 1921256 519361033 1352 125590226 8q6xn4yqwpr2u 1 14930380237 3208 1897893773 35 rows selected.
The output pointed me to the SQL with SQL_ID=8q6xn4yqwpr2u. I would call this method an “educated guess”. The V$SQLAREA and V$OPEN_CURSOR views did not point to the SQL that the session was executing at the moment. There may have been others sessions that executed the same SQLs from the list provided. However, in this particular case, I was lucky. The output clearly indicated the SQL that had just 1 execution, utilized a lot of CPU resources, and read though 1897893773*8/1024/1024/1024 = 14TB of data! In addition, I executed the SQL several times and noticed that the CPU utilization and logical read statistics steadily increased.
The mystery resolved
The rest was more or less straight forward. I found an execution plan for the SQL and DB objects involved.
SQL> select * from table(dbms_xplan.DISPLAY_CURSOR('8q6xn4yqwpr2u')); PLAN_TABLE_OUTPUT ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ SQL_ID 8q6xn4yqwpr2u, child number 0 ------------------------------------- select int.rowid , decode(int.SEGMENT1|| int.SEGMENT2|| int.SEGMENT3|| int.SEGMENT4|| int.SEGMENT5|| int.SEGMENT6|| int.SEGMENT7 , '', decode (ccid_cc.code_combination_id, NULL, NULL, replace(ccid_cc.SEGMENT4,'-',' ') || '-' || replace(ccid_cc.SEGMENT1,'-',' ') || '-' || replace(ccid_cc.SEGMENT2,'-',' ') || '-' || replace(ccid_cc.SEGMENT3,'-',' ') || '-' || replace(ccid_cc.SEGMENT5,'-',' ') || '-' || replace(ccid_cc.SEGMENT6,'-',' ') || '-' || replace(ccid_cc.SEGMENT7,'-',' ') ), replace(int.SEGMENT4,'-',' ') || '-' || replace(int.SEGMENT1,'-',' ') || '-' || replace(int.SEGMENT2,'-',' ') || '-' || replace(int.SEGMENT3,'-',' ') || '-' || replace(int.SEGMENT5,'-',' ') || '-' || replace(int.SEGMENT6,'-',' ') || '-' || replace(int.SEGMENT7,'-',' ') ) flexfield , nvl(flex_cc.code_combination_id, nvl(int.code_combination_id, -4)) , decode(int.SEGMENT1|| int.SEGMENT2|| int.SEGMENT3|| int.SEGMENT4|| int.SEGMENT5|| int.SEGMENT6|| int.SEGMENT7 , '', decode(ccid_cc.code_combination_id, null Plan hash value: 1344705414 ---------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ---------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | | | 5 (100)| | | 1 | SORT ORDER BY | | 1 | 171 | 5 (20)| 00:00:01 | | 2 | NESTED LOOPS OUTER | | 1 | 171 | 4 (0)| 00:00:01 | | 3 | NESTED LOOPS OUTER | | 1 | 112 | 2 (0)| 00:00:01 | |* 4 | TABLE ACCESS BY INDEX ROWID| GL_INTERFACE | 1 | 53 | 1 (0)| 00:00:01 | |* 5 | INDEX RANGE SCAN | GL_INTERFACE_N1 | 1 | | 1 (0)| 00:00:01 | |* 6 | TABLE ACCESS BY INDEX ROWID| GL_CODE_COMBINATIONS | 1 | 59 | 1 (0)| 00:00:01 | |* 7 | INDEX UNIQUE SCAN | GL_CODE_COMBINATIONS_U1 | 1 | | 1 (0)| 00:00:01 | |* 8 | TABLE ACCESS BY INDEX ROWID | GL_CODE_COMBINATIONS | 1 | 59 | 2 (0)| 00:00:01 | |* 9 | INDEX RANGE SCAN | GL_CODE_COMBINATIONS_N7 | 417 | | 2 (0)| 00:00:01 | ---------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 4 - filter("INT"."STATUS"<>'PROCESSED') 5 - access("INT"."USER_JE_SOURCE_NAME"='Invest' AND "INT"."SET_OF_BOOKS_ID"=1 AND "GROUP_ID"=2) 6 - filter(("CCID_CC"."TEMPLATE_ID" IS NULL AND "CCID_CC"."CHART_OF_ACCOUNTS_ID"=101)) 7 - access("CCID_CC"."CODE_COMBINATION_ID"="INT"."CODE_COMBINATION_ID") 8 - filter(("FLEX_CC"."TEMPLATE_ID" IS NULL AND "FLEX_CC"."CHART_OF_ACCOUNTS_ID"=101 AND "FLEX_CC"."SEGMENT1"="INT"."SEGMENT1" AND "FLEX_CC"."SEGMENT2"="INT"."SEGMENT2" AND "FLEX_CC"."SEGMENT3"="INT"."SEGMENT3" AND "FLEX_CC"."SEGMENT4"="INT"."SEGMENT4" AND "FLEX_CC"."SEGMENT5"="INT"."SEGMENT5" AND "FLEX_CC"."SEGMENT6"="INT"."SEGMENT6")) 9 - access("FLEX_CC"."SEGMENT7"="INT"."SEGMENT7") 52 rows selected. SQL>
The SQL was based on two tables: GL_INTERFACE (31MB big) and GL_CODE_COMBINATIONS (506MB). Did you notice the size of the tables and the amount of LIO the SQL read though? It seems a bit funny how SQL based on tables with a total size of 537MB can read though 15TB of data. Obviously, the nested loops weren’t the best way to join the tables in this case. The first thing I tried is to gather statistics on the GL_INTERFACE table.
SQL> begin sys.dbms_stats.create_stat_table(ownname=>'SYSTEM',stattab=>'cr884406stats',tblspace=>'users'); end; / 2 PL/SQL procedure successfully completed. SQL> begin DBMS_STATS.export_table_stats(ownname=>'GL',tabname=>'GL_INTERFACE',stattab=>'cr884406stats',statid=>'y01',cascade=>true,statown=>'SYSTEM'); end; / 2 3 4 PL/SQL procedure successfully completed. SQL> alter session set workarea_size_policy=manual; alter session set sort_area_size=1000000000; Session altered. SQL> Session altered. SQL> exec dbms_stats.gather_table_stats('GL',tabname=>'GL_INTERFACE',estimate_percent=>100,cascade=>TRUE); / PL/SQL procedure successfully completed. SQL> Session altered. SQL> begin DBMS_STATS.export_table_stats(ownname=>'GL',tabname=>'GL_INTERFACE',stattab=>'cr624406stats',statid=>'y02',cascade=>true,statown=>'SYSTEM'); end; / 2 3 4 PL/SQL procedure successfully completed. SQL>
To be on the safe side, I backed up the table statistics to a separate table. I know that starting from version 10G, DBMS_STATS package makes a backup of any statistic it overwrites. However, I still took a precautionary measure, just in case my activity made the other SQLs’ execution worse. After the stats were updated, the SQL reduced its total execution time to 1.4 seconds and changed its execution plan to the following:
SQL> select * from table(dbms_xplan.display); PLAN_TABLE_OUTPUT ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Plan hash value: 4195742536 --------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 90953 | 21M| 18816 (2)| 00:03:46 | | 1 | SORT ORDER BY | | 90953 | 21M| 18816 (2)| 00:03:46 | | 2 | NESTED LOOPS OUTER | | 90953 | 21M| 18811 (2)| 00:03:46 | |* 3 | HASH JOIN OUTER | | 90953 | 17M| 18810 (2)| 00:03:46 | |* 4 | TABLE ACCESS FULL | GL_INTERFACE | 90953 | 13M| 1062 (2)| 00:00:13 | |* 5 | TABLE ACCESS FULL | GL_CODE_COMBINATIONS | 2863K| 106M| 17730 (2)| 00:03:33 | |* 6 | TABLE ACCESS BY INDEX ROWID| GL_CODE_COMBINATIONS | 1 | 45 | 1 (0)| 00:00:01 | |* 7 | INDEX UNIQUE SCAN | GL_CODE_COMBINATIONS_U1 | 1 | | 1 (0)| 00:00:01 | --------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 3 - access("FLEX_CC"."SEGMENT1"(+)="INT"."SEGMENT1" AND "FLEX_CC"."SEGMENT2"(+)="INT"."SEGMENT2" AND "FLEX_CC"."SEGMENT3"(+)="INT"."SEGMENT3" AND "FLEX_CC"."SEGMENT4"(+)="INT"."SEGMENT4" AND "FLEX_CC"."SEGMENT5"(+)="INT"."SEGMENT5" AND "FLEX_CC"."SEGMENT6"(+)="INT"."SEGMENT6" AND "FLEX_CC"."SEGMENT7"(+)="INT"."SEGMENT7") 4 - filter("INT"."SET_OF_BOOKS_ID"=1 AND "INT"."USER_JE_SOURCE_NAME"='Invest' AND "GROUP_ID"=2 AND "INT"."STATUS"<>'PROCESSED') 5 - filter("FLEX_CC"."TEMPLATE_ID"(+) IS NULL AND "FLEX_CC"."CHART_OF_ACCOUNTS_ID"(+)=101) 6 - filter("CCID_CC"."TEMPLATE_ID"(+) IS NULL AND "CCID_CC"."CHART_OF_ACCOUNTS_ID"(+)=101) 7 - access("CCID_CC"."CODE_COMBINATION_ID"(+)="INT"."CODE_COMBINATION_ID") 27 rows selected. SQL>
We killed the GL process and re-executed it. The process finished in less than 20 minutes, and I never heard back from the client. :)
Summary
I hope that you learned something from my Friday the 13th story. The following are several points I would like to highlight.
- It doesn’t matter how knowledgeable you think you are. Sometimes you just need to use the “Phone A Friend” option to get some more ideas quickly. Gleb Otochkin (Twitter, Blog) made my day.
- It’s always handy to have good references for performance analysis basics. The “Beginning Performance Tuning” article from Arup Nanda (Twitter, Blog) is one to have in your bookmarks. :)
- You may need to be creative and use methods that may not seem right for the occasion at first glance (e.g. V$SQLAREA and V$OPEN_CURSOR views join). All that matters during emergencies is a result, and if the method helps to address the problem then who cares? We should use it. Right? :)
- I did a follow up with the problem and discussed it with another smart Oracle DBA from Pythian – Maris Elsins (Twitter, Blog). He pointed out that “oradebug dump errorstack 1” is another method to keep in an Oracle DBA toolbox. It dumps the session’s state, including the current SQL.
- Friday the 13th stories don’t always end horribly. :)
10 Comments. Leave new
Cool article, learned a lot from it. Thank you :)
Thanks Yury for sharing this .
Regards
GregG
Great story Yury, thanks for sharing it.
Wow! Folks,
Thank you for your support. It keep me going.
Honestly I had big doubts on publishing it. I was wondering if it would add any value in the format chosen. It looks like it does.
Thanks Paul, GregG and Joao Paulo for your feedback,
Yury
Nice post.
Did you try querying V$SESSION_LONGOPS view?
I don’t think I did Asif. This is another option to keep in mind. Thank you for the hint. :)
Nice post, i have faced issues like this related to stats, this post will help me investigate further down to tables instead of regenerating stats for the schema :)
Nice job Yury.
I must say that it is quite unreasonable for a client to request an update every 2-5 minutes.
If they want a problem resolved, they need to allow for greater periods of concentration.
Snapper by Tanel Poder is extremely handy in such pressure spots. It has saved my day several times. Thanks Yury for this story, which truly is a true blog post.
please let me know – how you calculate
SQL reduced its total execution time to 1.4 seconds and changed its execution plan
thanks,
aga