How to Overcome the Top 10 SQL Server Performance Challenges

Tudor Robins

December 1, 2020

Tags: Oracle, Technical Track

Working with consulting clients is rewarding, interesting, and challenging. In discussion with some of my colleagues, I’ve noticed the same performance challenges come up over and over again across clients with completely different hardware, industry verticals, levels of staff seniority, and more. I believe a few factors may be driving these issues including:

How easy Microsoft makes working with SQL Server;
The still-recent explosion of SQL Server as a fully mature enterprise-level product;
Misguided advice on the internet;
Or, a combination of the above.

The good news is that it's easy to avoid many of the worst performance killers. With the proper knowledge, most DBAs (database administrators) can act proactively and eliminate potential slowness by looking out for the most common issues that have plagued other people working with the product. In the interest of giving our readers succinct, practical advice I’ve collected the most common issues into the following top 10 list. Let’s get on with it, in no particular order:

1. You run an application that is mostly read activity (e.g. most OLTP) with insufficient indexes

I had a case where a client had a large table with several hundred million records and every time a specific query ran, it would paralyze the entire system. Because this particular query had to run several times per day, this was a constant headache. We got the execution plan for this query and immediately saw it was being scanned because there wasn’t a proper index for it. When we suggested adding an index, the client was very hesitant because the table was so large. They worried about the performance degradation for the “Insert,” “Delete,” “Update” activity and the increase in disk space. The truth is, if your system is already paralyzed by this situation, it’s very unlikely that adding a few milliseconds to “Insert,” “Delete,” “Update” operations is worse than letting a query run without a proper index against a large table. We showed them how their system was read-heavy and convinced them to add the missing index. Problem solved.

2. You’re still running a lot of reporting directly on the operational copy of your database

You may think running OLTP queries and reporting queries on one system will save money and / or hardware, but you’ll end up with neither the OLTP nor the reporting working properly. By definition, reporting queries will usually touch many records, and by doing them in the same machine you end up backing yourself into these corners:

You lock records; interrupting and slowing your OLTP-style queries.
You end up putting NOLOCK hints everywhere and compromising the quality of your reporting.
You activate snapshot isolation, and your IO system isn't ready for the extra tempdb load.

Let’s look at your options: availability groups, log shipping or good old replication. The optimal solution depends on your budget and your reporting goals, so make sure you look into what each one can do and what you might need to compromise for each.

3. Your VM hosts are way overcommitted

If your VM cluster has become the SQL Server bucket where all workloads go, don’t be surprised when it doesn’t perform well. I had someone tell me their “SQL Server doesn’t virtualize well,” – of course it doesn’t; neither does any resource-intensive application when it’s running on overcommitted virtual resources. Here are the main guidelines I use:

I don’t like overcommitting memory on the VM host. If the host has 128GB of RAM, I want all the VM’s memory to sum up to less than that amount, not more.
Although CPU is a more flexible resource, I still don’t recommend overcommitting it by 20 to 25 percent.
Many places have a DBA, a VM admin and a storage admin. The VM admin checks memory and CPU pools, but has a harder time figuring out if the IO limits are hit. So, make sure your storage admin is in the loop regarding your virtualization goals as well.

4. Your IO sucks

This one is pretty self-explanatory. I’ve seen environments where adding an all-flash array was like pressing the TURBO button. Suddenly the issues were about contention on spinlocks to squeeze the most out of the CPUs, instead of the same old PAGEIOLATCH from the slower storage. Create a performance monitor collection and get your average seconds per read and average seconds per write baselined. If your latency on read or write is consistently over 15 to 20 ms then your IO is slow. No ifs, no buts, even if the storage admin can’t see anything on the SAN. This will be your biggest bottleneck until you take care of it.

5. You have massive queries that are hard to optimize

Writing query optimizers is a heuristic, non-trivial effort that is simply HARD WORK. They also have to perform well most of the time to be taken seriously. As good as the SQL Server optimizer is, you, as a DBA or developer, can still help it out a lot of times. Here are some poor practices I still see too often:

Endless nested views (a view calls a view that calls a view that calls a view ...).
A lot of business logic and manipulation coded as single massive SQL statements. It’s easier to the optimizer if we feed it questions in more manageable chunks; don’t be afraid to break down a query and throw some intermediate results on a temp table if necessary.
Queries that want to do everything and end up underperforming for all cases. Again, it’s OK to throw this inside a stored procedure and write different SQL statements for different branches of the procedure. With statement level compilation this is not a bad thing anymore.

6. Tempdb is not properly configured

This one should be part of any DBA’s installation checklist.

Add more tempdb files than the default one. I like to use a post-install rule of 0.25* logical processors up to a max of eight. Adjust up as needed if you see contention once you have an actual workload running.
Pre-size them all, and set them all to the same size and autogrowth.
If you’re running a version older than 2016 use these trace flags 1117 / 1118 for best performance. After 2016, the behaviors enabled by these flags are controlled through database (MIXED_PAGE_ALLOCATION = OFF) and filegroup settings (AUTOGROW_ALL_FILES = ON).

7. Instant file initialization is not set and your autogrowth settings are still at the default

This is something else that should be part of the installation checklist (or part of the installer, period). Instant file initialization allows SQL Server to grow the data files without having to zero them out, making file growth an *instant* metadata operation on Windows. To enable instant file initialization you need to grant the SQL Server service account the “Perform volume maintenance tasks” privilege. This option is now built into the installer after SQL 2016 so there really is no reason why it should not be turned on for 99 percent of cases. Related to this recommendation, you should also:

Pre-grow your files to the best of your knowledge and monitor them so you can proactively grow them if necessary.
Ensure instant file initialization is enabled so if autogrowth does end up triggering, it doesn’t suspend your session.
Replace the awful 1MB to 10 percent defaults with a uniform increment that makes sense for your environment. For smaller databases I often use 100MB increments, for larger ones up to 4GB increments.

8. Your application or your queries are pulling way more data than necessary

I was reviewing an execution plan with a colleague a few days ago. The client was setting up an SSIS process to consume some data, transform and move it to another system, however the package was simply never completed. Going over the steps of the plan we found a join where the estimated data output was more than 1TB (yes, TERABYTE) of data through the execution plan. Going over the code we found a cartesian product in the logic, brought it up, refactored it with the developers and got much better performance once the data flowing inside the execution plan was orders of magnitude less. We’ve faced this issue before because of the way some applications are built. For example, bringing thousands of records over to the application when it’s well known that users almost never browse past the first five pages of any search results. Or, not implementing caching at the server level or on a dedicated caching server (like REDIS for example) to avoid constantly moving catalog-style data back and forth on the database. Or, queries that do the top N by some type of category or criteria but they still go through thousands or even millions of records to pick the top N because of the lack of proper filtering criteria.

9. You’re using distributed linked server queries without looking into how much data is getting pumped through the network

Linked servers are a cool and useful feature and certainly have their place inside the SQL Server toolbox. However, some applications take the linked server concept and insert it into operational or critical database code without taking into account the complexity or cost of distributed queries. Just because something works, doesn’t mean it’s the right thing to do. Again, this is one of those features that’s so easy and transparent to use that we forget about the implications. For example, I’ve found code where T-SQL code from server B is doing a join from two tables on remote server A, and one small table on B. This all works, but under the hood the optimizer for linked server queries is just not that good. In this case, it was pulling the data from both remote tables locally to do the join to the small local table, resulting in a lot of waste of CPU and network bandwidth. Refactoring this process, even though it was *working*, was worth it because of the big performance improvement of moving the small table to server A and doing all the computation there.

10. Your maintenance routines are not optimal

I’m a big fan (along with many other DBAs) of Ola Hallengren’s great maintenance solution . Using it will cover basics such as different fragmentation levels, updating stats that have been modified and more. I’ve seen environments where no index maintenance is done, and others where too much is done instead. The same applies to statistics; a very large percentage of the cases where an execution plan suddenly performs worse is down to statistics not being up to date. The out-of-the-box maintenance plans that ship with SQL Server definitely don’t help in this regard. Do yourself a favor and don’t try to reinvent the wheel; instead take advantage of Ola’s amazing community contribution. So there you have it; please check your servers today and see if any of the above items apply to you. If so, I hope this information will improve your performance and save you from future headaches! Did we miss something you consider a massive performance killer? If so, we’d love to hear from you in the comments.

SQL Server Health Check

While we hope these tips are helpful, if you’d like additional assistance, please ask about Pythian’s SQL Server Health Checks. Our experienced data platform consultants will support you in optimizing the performance and availability of your SQL server infrastructure. They will conduct a comprehensive examination of your SQL Server environment to highlight issues and provide actionable recommendations. Interested in booking a SQL Server Health Check? See here for a limited time offer.

Insight and analysis of technology and business strategy