Over the last few weeks I’ve been looking at several customers’ slow query logs, and I found in many of them an odd type of query. These are SELECT
statements that contain an IN
clause that includes dozens, sometimes hundreds of values. These statements often end in the slow query log. I’m not sure if these queries are this way by design or if they are generated by a specific database development tool.
I did some tests in one of my own databases, one with only around 10K rows in its largest table. The database corresponds to the Amarok media player. For example, I queried for songs by B. B. King (spelled “BB King”, “B.B. King”, etc. or with other artists: “B. B. King & Eric Clapton”).
The first query used a JOIN
and an IN
clause with all the spellings in my db; the second used the same JOIN
and WHERE ... name LIKE "BB%" OR name LIKE "B.%"
. Both had the same execution plan, and both retrieved the same number of results. In MySQL version 4.1 there were some enhancements to the optimizer for treating these massive IN
clauses, which means that for smaller databases, this is expected.
With bigger databases and more complex queries, things are different. For example, in a customer server one of these IN
clauses holds more than 100 values. In this case MySQL creates a temporary table and uses filesort. It should be easy to transform the IN
in an INSERT
into a TEMPORARY MEMORY
table, and use a JOIN
in place of the IN
clause in the original query. This way, we are avoiding the filesort altogether, which should result in a better query execution time. Unfortunately it’s not practical to post a query that would contain so many values specified in it.
What’s your experience with these kind of expressions? I’d love to learn where do these gigantic IN
clauses come from and hear some use-cases.
3 Comments. Leave new
My experience is that when IN () clauses goes above 50ish values, performance drops quite a bit. So I’d go with your approach with a temporary table and join to it instead.
Gerry,
I built a small custom C TCP app for a client. Its purpose is to rifle through content in a MySQL database, and “index” that content later for search purposes. Essentially you create a socket connection to the app, and pass the word/phrase you want to search, and based on a score it will return the id’s ( primary id’s of the record ) of those elements in DESC order ( again according to score not id ).
We then take those, and use not only the IN(), but ORDER BY FIELD(), to retain the order given back by the “search engine”. When this originally started, we had only ~14k records to pull from, and the entire process at that point, took on average < .1 ( specifically .002 for the actual query ). Now originally the result sets were only 5-10 records, and my EXPLAIN shows “Using where; Using index; Using filesort”
A bold decision to make the database almost 15x larger came down, and my fear was exactly what you explained here, that some of these results could have a hundred or so results, and everything would come to a screeching halt. We moved forward, and after testing found, that a lot of our records weren’t growing much past ~25 results or so. The total process is just about the same, and specifically the query now takes ~.004 seconds to execute.
I did some more testing, and found that up to ~50 results, I received pretty much the same execution time, but once I reached the ~100 record plateau that the time jumped to ~.008 ( essentially doubling ). I don’t do any JOIN’s in my query, and only use Primary Key integers in both the IN, and the FIELD() statement. Since its not a lot of text, compared to full words like ‘BB King’, and the table only has 2 indexes on it, maybe this is how I get away with it.
Also, for comparative measures, the machine I am running it on, is solely for the purpose of MySQL, and currently just for this search query. Its specs are as follows:
– MySQL 5.1.26-rc-log ( built from source, and a couple tuned params in the my.cnf, but not many compared to the mysql-huge example supplied in the source )
– MyISAM table
– Linux 2.6.26 kernel ( on a system essentially built from source similar to a LFS system )
– 2xAMD Dual-Core Opteron processors
– 8GB Memory
– 4×36.7GB SCSI HD’s in RAID 10
This is an interessting comment. Integers are pretty small, and even if there is a need of a filesort, chances are that the file might get cached in the OS, avoiding a big overhead. Using strings in the same situation, especially w/ multi-byte charsets, the data would be significantly bigger so it would create the additional disk IO.
I guess that when you’re using columns that aren’t that big (like in your example), it won’t matter that much as opposed of using bigger columns.
Thank you for your posting,
Gerry