MySQL InnoDB’s full text search overview

Pythian Marketing

May 6, 2016

Tags: MySQL, Security, Technical Track, Injection

NOTE: If you want to read and play with the interactive application, please go to the shinnyapps article. It has been developed using Shiny/R in order to allow you to see the effects of the algorithms. Thanks to Valerie Parham-Thompson at Pythian and Daniel Prince at Oracle. Github repository contains the code to generate and load the data and also, the Shiny/R code.

Some initial thoughts

A couple of weeks ago one of our customers came up with a question regarding FTS over InnoDB engine. Although the question is not answered in the current article, I came up with the conclusion that FTS is sometimes misunderstood. The point of this article is to show dynamically how the search algorithms work, using non-fictional data (data sources were downloaded from Gutenberg project within an easy interface (please see at the bottom of the ShinnyApps post here) . In order to show the effects off the field sizes over the query expansion algorithm, you will see two main tables (bookContent and bookContentByLine) both containing the same books in different approaches: by line and by paragraph. You'll see the noise generated by the QUERY EXPANSION algorithm when phrases are too large. For the sake of simplicity, in this article we won't go through the FTS parsers. That is possible material for a future post.

Why I consider FTS sometimes misunderstood?

FTS is a technology that can be use for any purpose, not only simple searches. Generally, FTS engines are placed to work as a service for web or document searches, which generally require technologies like Solr, ElasticSearch or Sphinx. However, certain bussines rules require complex searches, and having such feature inside RDBMS can be a win. RDBMS aren't a good place for massive amount of FTS queries, without using any of the join capabilities that they offer, or the ACID properties. As I said above, FTS is totally acceptable in RDBMS, if you are using at least one RDBMS main feature, required by your bussines model.

Action!

To start showing the effects of the algorithms, the following example searches the word 'country' using query expansion. This means that we are not looking only the exact matches, but also the entries that appear the most when the the exact match has been found. In the SELECT clause you'll see both FTS expressions using NATURAL LANGUAGE with query expansion and BOOLEAN modes respectively. https://gist.github.com/3manuek/9ffc83e2db4fcef85e57dd3dadbe7b6e The noise generated by the query expansion is expected and described in the official documentation here. The interesting case is the following row, which has 2 exact occurrences (you can see the positions 1 and 63) and it is not the highest rank using query extension. Remember, this is expected.

 Text: "country districts. As Lucca had five gates, he divided his own country" bookid: 1232 pos: 1,63 QERank: 80 BoolRank: 14

This is even worse when using large sentences. In the example bellow you will see the same query, against the table storing by paragraph. The boolean rank shows some of the entries way above others, however the query extension locates at the top records that not necessarily has a lot of exact matches. https://gist.github.com/3manuek/d5e5f3257696fb9a9d2a6b6a12e261b5 The query expansion is useful when you intend to search which entries contain more words that appear frequently within the search term. Having large text fields increase the probability to have more words that appear among the search term. In the case of bookContent table (by paragraph table), the average field size is 443.1163 characters.

The INNODB_FT_INDEX_TABLE

There is a way to play with the contents of the FTS indexes. As you may noticed in the previous examples, I used the set global innodb_ft_aux_table = 'ftslab/bookContent'; statement, which loads the index content to memory for an easy querying. If you use RDS, the option innodb_ft_aux_table is not available as it is GLOBAL and require SUPER privileges. i.e. You can easily get the most frequent tokens: https://gist.github.com/3manuek/692a628fc70228194396f33d6d76d4d9 We can query the index contents with a simple SQL statement like the following: https://gist.github.com/3manuek/cb815326a56bd2d29a0db23e0423455a

In the example shown before the is no intention to compare ranks score as they are based in different algorithms. The idea there is to show that QUERY EXPANSION can have non desire results in some cases due to its mechanism.

Building custom stopwords

It probably isn't very useful information as most of these words appears too frequently and are modal verbs, adverbs, pronouns, determiners, etc. It could be the case that you are not interested on indexing those words. If that's the case you can add them as stopwords in your own stopwords table. Specially if you are more interested in boolean searches, loosing some part of the language expressions. We can build a custom stopwords table based on our current data: https://gist.github.com/3manuek/30b1f8483857018cc82796dd7dda44ae Let's build our stopwords table using both default and new entries and keeping the alphabetical order: https://gist.github.com/3manuek/2146012a17c76acfa0fcbfcc229fa1c7 The idea behind choosing our own stopwords is to measure how much index do we safe filtering those words that are extremely frequent and don't add a necessary meaning to the search. This topic could be covered in a separate blog post.

Going ahead on choosing stop words

The full article is amazingly interesting. In brief, it says that the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, and so on (rank-frequency distribution is an inverse relation).

Considerations and recommendations

- Use QUERY EXPANSION only if you are interested in searching relations over exact matches. Remember that the field size is crucial when using this. - FTS is not the best fit for exact string matches in single columns. You don't want to use FTS for searching emails in a single column, name and lastname fields , i.e. For those, you'll probably use other techniques as reverse searches , exact match operator (=) or hashing (CRC32 for emails or large texts smaller than 255 characters). - Keep your FTS indexes short. Do not add ALL the text columns. Parse first from your application the user search and adapt the query. - If you are using BOOLEAN MODE, you can use the rank score to filter rows. MySQL is clever enough to optimize the FTS functions to avoid double executions. You can do this using something like: match(content,title) against ("first (third)") > 1 . Generally, scores lower than 1 can be ignored when using boolean or natural mode searches. - `OPTIMIZE TABLE` does a rebuild of the table. To avoid this, set innodb_optimize_fulltext_only=1 in order to do an incremental maintance on the table. - Recall that NATURAL LANGUAGE MODE does not take the operands as the BOOLEAN MODE. This affects the ranking score (try +bad (thing) i.e.) - If you plan to order by rank, it is not necessary to specify the clause `ORDER BY` as InnoDB does the order after retrieve the doc ids . Also,the behavior is different from the default as it returns the heaviest at the top (like an ORDER BY rank DESC). - If you come from MyISAM's FTS implementation, recall that the ranking scoring is different. - Create the FULLTEXT index after the data is loaded InnoDB Bulk Load. When restoring FTS backups, you will probably hit the "ERROR 182 (HY000) at line nn: Invalid InnoDB FTS Doc ID". - Try to avoid using use more than one FTS expression in the where clause. Keep in mind that this affects the order in the results and it consumes a considerably amount of CPU. InnoDB orders by the latest expression in the WHERE clause. WL#7123. - Also, if avoiding the rank information in the projection (SELECT clause) and using other aggregations like count(*), will use the "no ranking" FT_hints. The LIMIT hint won't be used if invoked explicitly an ORDER BY and the MATCH clause in the projection. https://gist.github.com/3manuek/2fd59b9a3a187784e02f82da520d775a - If you plan to use FTS_DOC_ID column with AUTO_INCREMENT option, have in mind that there is a limitation regarding this. You must declare a single column PRIMARY KEY constraint or as an UNIQUE index. Also, the data type is stricted as `bigint unsigned`. i.e: https://gist.github.com/3manuek/df441753ea7666c3df7d13babc566c52

FT_QUERY_EXPANSION_LIMIT

This variable controls the number of top matches when using `WITH QUERY EXPANSION` (affects only MyISAM). Reference.

Bug 80347 - Invalid InnoDB FTS Doc ID

 emanuel@3laptop ~/sandboxes/rsandbox_5_7_9 $ ./m dumpTest < full.dump ERROR 182 (HY000) at line 73: Invalid InnoDB FTS Doc ID

emanuel@3laptop ~/sandboxes/rsandbox_5_7_9 $ ./m dumpTest < ddl.dump emanuel@3laptop ~/sandboxes/rsandbox_5_7_9 $ ./m dumpTest < onlyData.dump emanuel@3laptop ~/sandboxes/rsandbox_5_7_9 $ ./m dumpTest < full.dump ERROR 182 (HY000) at line 73: Invalid InnoDB FTS Doc ID mysqldump is not very clever if you use `FTS_DOC_ID`:

 2016-02-13T22:11:53.125300Z 19 [ERROR] InnoDB: Doc ID 10002 is too big. Its difference with largest used Doc ID 1 cannot exceed or equal to 10000

It takes dumps without considering the restriction coded in `innobase/row/row0mysql.cc`: Difference between Doc IDs are restricted within 4 bytes integer. See fts_get_encoded_len() The fix to this is backuping the table by chunks of 10000 documents.

Insight and analysis of technology and business strategy