Sleuthing for Temporary Space Spendthrifts

Pythian Marketing

December 20, 2012

I was recently looking at an issue where a large database server was running out of temp space. The issue had occurred in the past, so the goal was to identify what had been causing the system to run out of temp. Although ORA-1555 messages will appear in the database alert log (with query text) when a query runs out of temporary space, they only talk about the query that was “unlucky” enough to be attempting to get space when the tablespace was full, not who was using the temp space.

A quick web search shows that most people seem to be querying current tables. Coskan Gundogar blogged about a historical monitoring script he wrote but only drilled down into a specific session and serial number.

A quick look at AWR showed that v$tempseg_usage had no direct historical view. There is dba_hist_tbspc_usage info for the temp tablespace, but it only shows aggregate usage.

However, dba_hist_active_sess_history, the AWR samples of active session history data, has a new column in 11gR2: TEMP_SPACE_ALLOCATED. From the docs:

Amount of TEMP memory (in bytes) consumed by this session at the time this sample was taken.

Having temp usage in a table like ASH is actually much more useful than a v$tempseg_usage capture since it lets us drill down not only to the query, but also to the exact SQL plan and plan step where the temp space is being used.

Querying out the data requires a bit of elbow grease though: There is no delta column like for the I/O and time model stats, so we need to use some fancy analytic functions to compute the delta values ourselves.

dba_hist_active_sess_history itself is a view composed of multiple joins on the back end, so it takes quite a while to query. I generally create a fixed version of the table with the timestamp I want to analyze, in order to speed up the queries later. And since this is an Exadata system, I’ll take advantage of hybrid columnar compression. Adjust the begin_interval_time range to match the date range you’re looking at.

CREATE TABLE marc_ash COMPRESS FOR QUERY LOW as
SELECT *
FROM   dba_hist_active_sess_history
WHERE  dbid = (SELECT dbid
               FROM   v$database)
       AND snap_id IN (SELECT snap_id
                       FROM   dba_hist_snapshot
                       WHERE  dbid = (SELECT dbid
                                      FROM   v$database)
                              AND begin_interval_time BETWEEN To_date(
                                  '20-DEC-2012 05:00',
                                  'dd-mon-yyyy hh24:mi')
                                  AND
                                  To_date(
                                  '20-DEC-2012 10:00', 'dd-mon-yyyy hh24:mi'));

(Side note: I’m trying out an online SQL beautifier tool. Readers, do you find this format easier to read?)

Once the base table is created, I can query it:
(Thanks to Randolf Geist for pointing out a missing grouping for parallel execution)

SELECT *
FROM   (SELECT Rank()
                 over (
                   ORDER BY SUM(Nvl(temp_space_delta, 0)) DESC) position,
               sql_id,
               sql_plan_hash_value,
               sql_plan_operation,
               sql_plan_line_id,
               Count(DISTINCT sql_exec_id)                      total_execs,
               Trunc(SUM(Nvl(temp_space_delta, 0)) / 1024 / 1024)
               ||'m'                                            temp_usage
        FROM   (SELECT sql_exec_id,
                       sql_id,
                       sql_plan_hash_value,
                       sql_plan_operation,
                       sql_plan_line_id,
                       temp_space_allocated - Nvl(Lag(temp_space_allocated, 1)
                       over (
                         PARTITION BY sql_exec_id, sql_id, session_id
                         ORDER BY sample_id), 0)
                       temp_space_delta
                FROM   marc_ash)
        GROUP  BY sql_id,
                  sql_plan_operation,
                  sql_plan_hash_value,
                  sql_plan_line_id)
WHERE  position <= 10
ORDER  BY position;

A bit of explanation is in order: The core of the query selects from marc_ash using the lag() analytic function to get the previous temp space allocation for the same execution, SQL statement, and parallel query process, and subtracts it to get the difference in temp space usage for the current sample. I then take this result, plus some information about the SQL step executing, and group it by SQL ID, plan operation, and line ID. And finally, I sort the results, keeping the top 10, and apply some formatting to the result.

And as is usual with ASH, since I’m only looking at 10-second samples of data, the results don’t reflect 100% what really happened. Though the longer the execution step runs, the more accurate the end results will be.

Here’s a sample result (with a bit of cleanup of the column headers):

POSITION SQL_ID         SQL_PLAN_HASH   SQL_PLAN_OP     SQL_PLAN_LINE_ID TOTAL_EXECS    TEMP_USAGE
------- --------------- --------------- --------------- --------------- --------------- ----------
1       a9hrk7g2ahbtk   1062456949      HASH JOIN       15              1               142607m
2       b40ahcxu8nxdq   2741224596      HASH JOIN       21              1               139959m
3       b62gn3yqyau4k   3673174370      SORT            1               4               92378m
4       5rf3btrz44d6v   3879859951      SORT            3               1               35128m
5       5rf3btrz44d6v   3879859951      SORT            6               1               34095m
6       8ucjdxhxwz93g   2599798119      HASH            15              1               10873m
7       901mxasy6gu2j   801798936       SORT            2               28              6268m
8       4b7jatzjaf4z3   270793325       SORT            3               3               5876m
9       0h5cvsbs92cyp   1935762178      SORT            1               3               4522m
10      84z212am4j987   3445415588      SORT            1               2               4182m

During this time period, there was high temp usage from the two plan steps of SQL ID 5rf3btrz44d6v. Even though it only ran once, it consumed close to 70G of temp space during this period. The benefit of using ASH data, though, is that I not only know which SQL statement is responsible, but I also have an idea of which plan steps are responsible: two sort operations.

We can plug in the SQL ID and plan hash value in the output to dbms_xplan.display_awr to see the plan itself. The results show the full query text as well as the execution pan involved.

A short snippet of the plan:

------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                         | Name          | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     | Pstart| Pstop |    TQ  |IN-OUT| PQ Distrib |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                                  |               |       |       |       |   100K(100)|          |       |       |        |      |            |
|   1 |  PX COORDINATOR                                   |               |       |       |       |            |          |       |       |        |      |            |
|   2 |   PX SEND QC (ORDER)                              | :TQ10009      |    76M|    18G|       |   100K  (1)| 00:20:04 |       |       |  Q1,09 | P->S | QC (ORDER) |
|   3 |    SORT ORDER BY                                  |               |    76M|    18G|    21G|   100K  (1)| 00:20:04 |       |       |  Q1,09 | PCWP |            |
|   4 |     PX RECEIVE                                    |               |    76M|    18G|       |   100K  (1)| 00:20:04 |       |       |  Q1,09 | PCWP |            |
|   5 |      PX SEND RANGE                                | :TQ10008      |    76M|    18G|       |   100K  (1)| 00:20:04 |       |       |  Q1,08 | P->P | RANGE      |
|   6 |       SORT GROUP BY                               |               |    76M|    18G|    21G|   100K  (1)| 00:20:04 |       |       |  Q1,08 | PCWP |            |
|   7 |        PX RECEIVE                                 |               |    76M|    18G|       |   100K  (1)| 00:20:04 |       |       |  Q1,08 | PCWP |            |
|   8 |         PX SEND HASH                              | :TQ10007      |    76M|    18G|       |   100K  (1)| 00:20:04 |       |       |  Q1,07 | P->P | HASH       |
|   9 |          HASH GROUP BY                            |               |    76M|    18G|    21G|   100K  (1)| 00:20:04 |       |       |  Q1,07 | PCWP |            |
|  10 |           VIEW                                    | VW_DAG_0      |    76M|    18G|       | 67638   (1)| 00:13:32 |       |       |  Q1,07 | PCWP |            |
|  11 |            HASH GROUP BY                          |               |    76M|    16G|    17G| 67638   (1)| 00:13:32 |       |       |  Q1,07 | PCWP |            |
|  12 |             HASH JOIN                             |               |    76M|    16G|       | 37552   (2)| 00:07:31 |       |       |  Q1,07 | PCWP |            |
|  13 |              BUFFER SORT                          |               |       |       |       |            |          |       |       |  Q1,07 | PCWC |            |
|  14 |               PX RECEIVE                          |               |  2709K|    67M|       |  5287   (1)| 00:01:04 |       |       |  Q1,07 | PCWP |            |
|  15 |                PX SEND HASH                       | :TQ10004      |  2709K|    67M|       |  5287   (1)| 00:01:04 |       |       |        | S->P | HASH       |
|  16 |                 TABLE ACCESS STORAGE FULL         | TAB1          |  2709K|    67M|       |  5287   (1)| 00:01:04 |       |       |        |      |            |

A very quick look at this execution plan seems to indicate that it’s returning an absolutely massive dataset to the database client and sorting it twice. It would benefit from more conditions to restrict results to what the client is actually looking for.

Any comments, readers? How are you monitoring for temp space usage? Anyone using resource management to control temp usage?

Insight and analysis of technology and business strategy

Sleuthing for Temporary Space Spendthrifts

Top Categories

Tell us how we can help!