Part of the appeal of AWS’ Redshift is that it’s a managed service, which means lower administration costs. While you don’t have to hire a full time DBA to make sure it runs smoothly (from Pythian’s experience it takes ~10-20 hours/month to manage Redshift), there are still some tasks that should be attended to keep it happy:
- Skew analysis
- Compression analysis
- Query monitoring
Let us start with Vacuuming as the first topic of a series of deeper dives into this list.
Vacuuming is an integral part of performance maintenance of Redshift. Since deletes and updates both flag the old data, but don’t actually remove it, if we’re doing those kinds of actions, vacuuming is needed to reclaim that space. Updates and deletes can be pretty big performance hits (a simple update can easily take 60 secs on a 50 million record table on a small cluster, so we’re looking at 20 minutes for a similar update on a 1 billion record table), so we try to avoid them as much as we can on large tables. The space reclamation portion of the vacuum typically accounts for 10% of the time we see spent on the tables. We can use the SORT ONLY parameter to skip this phase, but we generally have no compelling reason to.
In addition, if tables have sort keys, and table loads have not been optimized to sort as they insert, then the vacuums are needed to resort the data which can be crucial for performance. While loads of empty tables automatically sort the data, subsequent loads are not. We have seen query times drop by 80% from the implementation of vacuuming, but of course the impact varies with table usage patterns.
The biggest problem we face with vacuuming is the time it takes. While vacuuming does not block reads or writes, it can slow them considerably as well as take significant resources from the cluster, and you can only vacuum one table at a time. Remember that resource utilization can be constrained through WLM queues. A typical pattern we see among clients is that a nightly ETL load will occur, then we will run vacuum and analyze processes, and finally open the cluster for daily reporting. The faster the vacuum process can finish, the sooner the reports can start flowing, so we generally allocate as many resources as we can.
Let us start with the process itself. It’s simple enough and you can get syntax documentation from AWS . There’s not too much that’s tricky with the syntax and for most use cases
will suffice. Note that INTERLEAVED sort keys need the REINDEX parameter added for all re-indexing to occur. You can discern which tables have this set up by using the query:
select schemaname, tablename from (SELECT n.nspname AS schemaname ,c.relname AS tablename ,min(attsortkeyord) min_sort FROM pg_namespace AS n INNER JOIN pg_class AS c ON n.oid = c.relnamespace INNER JOIN pg_attribute AS a ON c.oid = a.attrelid WHERE c.relkind = 'r' AND abs(a.attsortkeyord) > 0 AND a.attnum > 0 group by 1,2 ) where min_sort<0
In order to give the vacuum process more resources, we preface this command with
SET wlm_query_slot_count TO <N>;
where N is the maximum number of query slots we think we can get away with. If you’re not sure what that number should be (we’ll discuss WLM queues in another post), usually 5 is a safe number though be warned that if the value of wlm_query_slot_count is larger than the number of available slots for the service class, the vacuum command will fail.
Knowing when to vacuum is reasonably straight forward. Anytime after substantial inserts, updates, or deletes are made is always appropriate, but you can be more exacting by querying two tables:
select * from STL_ALERT_EVENT_LOG where Solution LIKE ‘%VACUUM command%’
select * from SVV_TABLE_INFO where unsorted > 8
The latter check works great for daily loads. We will often set the threshold at 8 (percent) immediately after the loads, then run another vacuum process in the evening with a lower threshold (4 percent) that addresses larger tables that take a fair amount of time to vacuum since we want to avoid that situation in the morning.
Last fall AWS built a nice tool to automate vacuums, Analyze & Vacuum Schema Utility, that incorporated these queries. It works quite well, and we recommend it to our clients as a simple way to set up this maintenance. However, note that it does not automatically add the REINDEX parameter for those tables with INTERLEAVED sortkeys. The code is all available, so it is easy enough to adjust to make more custom filtering of tables (on fact_* and dim_* for instance) within a schema.
AWS has built a very useful view, v_get_vacuum_details, (and a number of others that you should explore if you haven’t already) in their Redshift Utilities repository that you can use to gain some insight into how long the process took and what it did. None of the system tables for vacuuming keep any historical information which would be nice for tracking growing process times, but you can see them for a week in STL_QUERY which gets purged to a history of 7 days. I recommend creating a simple process to track the vacuum data:
First create the table:
create table vacuum_history sortkey (xid) as select * from v_get_vacuum_details where processing_seconds > 0;
Then set up a cron process to populate:
0 18 * * * psql -h myRScluster -U myUser -p5439 -c “INSERT INTO vacuum_history SELECT * FROM v_get_vacuum_details WHERE xid > (SELECT MAX(xid) FROM vacuum_history) where processing_seconds > 0;” &> /var/log/vacuum_history.log
Once you start to see tables taking an inordinate amount of time to vacuum, some additional intervention may be appropriate. Our team recently ran into a sizable table (3 billion records) that had been taking 3 hours to vacuum daily. Some issue occurred where the table needed a partial reload of 2 billion rows. Once that finished, we ran a vacuum which kept going all afternoon. Checking SVV_VACUUM_PROGRESS we could see that it would take almost 30 hours to complete. Note that restarting a stopped vacuum does not mean the process will pick up where it left off. Since this would have impacted the daily load performance, we killed the vacuum with “cancel <pid>” using the pid pulled from
select pid, text from SVV_QUERY_INFLIGHT where text like ‘%Vacuum%’
We then ran a deep copy (created a new version of the table and ran a SELECT INTO) which took about 5 hours. The load into an empty table triggers the correct sorting, so a subsequent vacuum took only a few minutes to complete.
Just a note on killing long running vacuums: it sometimes doesn’t work especially once it’s in the initialize merge phase. We’ve found that continually issuing the cancel command while it’s in the sort phase is effective, but the point it to be wary of vacuuming large tables for their first time. Vacuums on large, unsorted tables write temporary data to disk, so there is also the potential to run out of disk and freeze the cluster, so be sure to always check that up to 3x the table size of disk space is available.
There are a few simple strategies to prevent long running vacuums:
- Load your data in SORTKEY order: The incoming data doesn’t have to be pre-ordered, just greater than existing data.
- Vacuum often: A table with a small unsorted region vacuums faster than one with a large unsorted region.
- If tables become too large to vacuum within a maintenance window, consider breaking them apart: We often see multi-billion record tables where the only data being queried is from the last month or two.
- Deep copies can be a faster solution than vacuums.
We are getting the below error while trying to run vacuum in redshift.
An error occurred when executing the SQL command:
[Amazon](500310) Invalid operation: VACUUM is running;
1 statement failed.
Execution time: 1.65s
But we have not run vacuum for more than one week and cannot see any recent history of vacuum running currently.How to fix this error?
Hey, I am experiencing a similar error. Think it is do to the issue mentioned here: https://forums.aws.amazon.com/thread.jspa?threadID=135790. Apparently you can only run one VACUUM at a time in the same cluster. To check if this is true, I tried vacuuming two different tables at the same time and I got the same error. So for me I think the error might be caused, because someone else was vacuuming at the same time in the same cluster.