I’m rather passionate about what we do here at PalominoDB. When I tell people about what makes us different from the competition, I often discuss the proactive work we do. We are not a reactive company. We know that proactive reviews are what keep a database up and running smoothly, and we definitely want to prevent those late night pages that everybody loves. So what do we do to make this happen?
We start with daily health checks. These are aided by scripts, but include the Primary DBA reviewing the last 4 days of core cacti graphs (or a similar trending tool, as we require clients to have one) for anomalous behavior. Core indicates key metrics that point out workload shifts – CPU, IOWait, Load Average, Swapping, SQL Query Type Counters and InnoDB Row Activity. We verify backup logs are error free, and we make sure nothing has shown up in the MySQL error logs that proves unusual. Finally, we review the Nagios alerts of the last day, making sure criticals are followed up on, that nothing has been acknowledged and forgotten and that alerts are enabled.
Once a week, the primary DBA then reviews all cacti graphs, not just core ones. They review the dailies and ensure all items that have come up are being acted on and they verify tickets are not stalling.
Once a month, we do SQL Reviews of systems, unless we have been requested to do them more frequently based on release schedules. Using mk-query-digest, general logs with microsecond patches or tcpdumps, we put together a list of the top executed and the slowest queries and provide our clients with a list of recommended changes to indexes, datatypes, data model changes, caching suggestions and query rewrites. Once provided, we follow up regularly to ensure that crucial changes are being implemented in a timely manner.
Then we have activities that we do quarterly. Some clients prefer these be done more frequently and that is fine by us! These include:
- Recovery tests if regular refreshes of other environments don’t test this.
- Ensuring tools are at up to date versions.
- Ensuring runbooks and documentation are up to date.
- Capacity reviews of existing workloads and hardware.
- Security audits to ensure that changes since the last audit do not violate security policy.
- Monitoring and trending audits to ensure that appropriate checks are in place and that all needed graphs are graphing.
This is where we start in our proactive service. The end is unlimited, only depending on our clients’ needs. Any opportunity we can have to solve an issue before it is noticed is an opportunity that we relish and that our clients appreciate. Of course, the real test of our mettle is the number of pages waking us up in the middle of the night. If we have to wear our asbestos suits to work, then we are not doing our job!
No comments