One of our clients had an ORA-1555 “snapshot too old” error two nights in a row. The quick and dirty fix would be to increase the retention_period
and the potential size of the UNDO
tablespace.
I was looking at it together with Dave, my new team mate, and a small detail popped up right away — Query Duration=76584
— 21+ hours? I checked the retention period, and it was 2 hours, so the dirty fix would probably fail unless it’s very dirty — dumping undo retention to something like a day and blowing the UNDO
tablespace, and still without guarantee that the query finish within a day. (In fact, I rather think that it would run for 42 hours. But I digress.)
The query was simple (names obscured):
SELECT * FROM TABLE1 WHERE (COL_DATE >= (SELECT MAX(COL_DATE) FROM TABLE2 WHERE FILTER_COL = :"SYS_B_0") );
Table1 has millions of rows, so my first reaction was to check if there are histograms on this column, whether it has an index on it, and when statistics were last collected. There were no index and no histograms, so the execution plan couldn’t be any different than a full table scan. In this case, the query shouldn’t really be running for longer than 21 hours.
We couldn’t easily get the client’s developer’s advice on which batch this statement belongs to, but it definitely was a batch and seems to be regular. So, assuming that the duration is correct, the query was supposed to be running right at that moment. I checked the V$SESSION
with STATUS='ACTIVE' OR LAST_ET_CALL > 3600
— nothing back. Dave recalled that there was a bug related to the reported duration of queries reported within ORA-1555 error and in view V$UNDOSTAT
. Fair enough — we located the Metalink Note and used another query with a workaround giving the same result.
Next step was to look at V$SQL
and see if the query was still there and running. And, bingo! there it was! V$SQL provided the hash value and V$SESSION.SQL_HASH_VALUE
showed the guilty session with status ‘INACTIVE’ and LAST_CALL_ET
in the range of a few minutes. A quick peek into V$SESSION_WAIT
showed that the session was waiting on SQL*Net message from client
for several minutes. I was already 99% sure what was going on, and I just needed a few minutes to prove it with a 10046 trace, so we enabled 10046 event with level 8. The content of the trace file speaks for itself:
*** 2007-05-15 09:33:12.418 WAIT #1: nam='SQL*Net message from client' ela= 256877226 p1=1413697536 p2=1 p3=0 WAIT #1: nam='db file scattered read' ela= 8144 p1=12 p2=39721 p3=12 WAIT #1: nam='SQL*Net message to client' ela= 4 p1=1413697536 p2=1 p3=0 WAIT #1: nam='SQL*Net more data to client' ela= 175 p1=1413697536 p2=2001 p3=0 FETCH #1:c=0,e=10805,p=12,cr=4,cu=0,mis=0,r=100,dep=0,og=4,tim=6190387721660 *** 2007-05-15 09:37:34.942 WAIT #1: nam='SQL*Net message from client' ela= 256356355 p1=1413697536 p2=1 p3=0 WAIT #1: nam='SQL*Net message to client' ela= 3 p1=1413697536 p2=1 p3=0 WAIT #1: nam='SQL*Net more data to client' ela= 156 p1=1413697536 p2=2001 p3=0 FETCH #1:c=0,e=1462,p=0,cr=2,cu=0,mis=0,r=100,dep=0,og=4,tim=6190644080126
So the query was not actively kicking and this is why we didn’t catch it in the beginning. I should also mention that the program (as indicated in V$SESSION
) was DTSRun.exe, so it seems that the culprit is some kind of data-extraction job that fetches 100 rows and processes them for 256 seconds before going for another fetch. Well, what would you expect from a Microsoft tool? ;-)
Anyway, since DTS is not our responsibility, we had to hand this issue to the relevant group with the definitive diagnosis, “It’s not a database problem.”
No comments