NOTE: If you want to read and play with the interactive application, please go to the shinnyapps article. It has been developed using Shiny/R in order to allow you to see the effects of the algorithms.
Thanks to Valerie Parham-Thompson at Pythian and Daniel Prince at Oracle.
Github repository contains the code to generate and load the data and also, the Shiny/R code.
Some initial thoughts
A couple of weeks ago one of our customers came up with a question regarding FTS over InnoDB engine. Although the question is not answered in the current article, I came up with the conclusion that FTS is sometimes misunderstood.
The point of this article is to show dynamically how the search algorithms work, using non-fictional data (data sources were downloaded from Gutenberg project within an easy interface (please see at the bottom of the ShinnyApps post here) .
In order to show the effects off the field sizes over the query expansion algorithm, you will see two main tables (bookContent and bookContentByLine) both containing the same books in different approaches: by line and by paragraph. You’ll see the noise generated by the QUERY EXPANSION
algorithm when phrases are too large.
For the sake of simplicity, in this article we won’t go through the FTS parsers. That is possible material for a future post.
Why I consider FTS sometimes misunderstood?
FTS is a technology that can be use for any purpose, not only simple searches. Generally, FTS engines are placed to work as a service for web or document searches, which generally require technologies like Solr, ElasticSearch or Sphinx. However, certain bussines rules require complex searches, and having such feature inside RDBMS can be a win.
RDBMS aren’t a good place for massive amount of FTS queries, without using any of the join capabilities that they offer, or the ACID properties.
As I said above, FTS is totally acceptable in RDBMS, if you are using at least one RDBMS main feature, required by your bussines model.
Action!
To start showing the effects of the algorithms, the following example searches the word ‘country’ using query expansion. This means that we are not looking only the exact matches, but also the entries that appear the most when the the exact match has been found.
In the SELECT
clause you’ll see both FTS expressions using NATURAL LANGUAGE
with query expansion and BOOLEAN
modes respectively.
The noise generated by the query expansion is expected and described in the official documentation here.
The interesting case is the following row, which has 2 exact occurrences (you can see the positions 1 and 63) and it is not the highest rank using query extension. Remember, this is expected.
Text: "country districts. As Lucca had five gates, he divided his own country"
bookid: 1232
pos: 1,63
QERank: 80
BoolRank: 14
This is even worse when using large sentences. In the example bellow you will see the same query, against the table storing by paragraph. The boolean rank shows some of the entries way above others, however the query extension locates at the top records that not necessarily has a lot of exact matches.
The query expansion is useful when you intend to search which entries contain more words that appear frequently within the search term. Having large text fields increase the probability to have more words that appear among the search term. In the case of bookContent table (by paragraph table), the average field size is 443.1163 characters.
The INNODB_FT_INDEX_TABLE
There is a way to play with the contents of the FTS indexes. As you may noticed in the previous examples, I used the set global innodb_ft_aux_table = 'ftslab/bookContent';
statement, which loads the index content to memory for an easy querying.
If you use RDS, the option innodb_ft_aux_table is not available as it is GLOBAL and require SUPER privileges.
i.e. You can easily get the most frequent tokens:
We can query the index contents with a simple SQL statement like the following:
In the example shown before the is no intention to compare ranks score as they are based in different algorithms. The idea there is to show that QUERY EXPANSION can have non desire results in some cases due to its mechanism.
Building custom stopwords
It probably isn’t very useful information as most of these words appears too frequently and are modal verbs, adverbs, pronouns, determiners, etc. It could be the case that you are not interested on indexing those words. If that’s the case you can add them as stopwords
in your own stopwords table. Specially if you are more interested in boolean searches, loosing some part of the language expressions.
We can build a custom stopwords table based on our current data:
Let’s build our stopwords table using both default and new entries and keeping the alphabetical order:
The idea behind choosing our own stopwords is to measure how much index do we safe filtering those words that are extremely frequent and don’t add a necessary meaning to the search. This topic could be covered in a separate blog post.
Going ahead on choosing stop words
The full article is amazingly interesting. In brief, it says that the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, and so on (rank-frequency distribution is an inverse relation).
Considerations and recommendations
– Use QUERY EXPANSION only if you are interested in searching relations over exact matches. Remember that the field
size is crucial when using this.
– FTS is not the best fit for exact string matches in single columns. You don’t want to use FTS for searching emails in a single column, name and lastname fields , i.e. For those, you’ll probably use other techniques as reverse searches , exact match operator (=) or hashing (CRC32 for emails or large texts smaller than 255 characters).
– Keep your FTS indexes short. Do not add ALL the text columns. Parse first from your application the user search and adapt the query.
– If you are using BOOLEAN MODE, you can use the rank score to filter rows. MySQL is clever enough to optimize the
FTS functions to avoid double executions. You can do this using something like: match(content,title) against ("first (third)") > 1
. Generally, scores lower than 1 can be ignored when using boolean or natural mode searches.
– `OPTIMIZE TABLE` does a rebuild of the table. To avoid this, set innodb_optimize_fulltext_only=1 in order to do an incremental maintance on the table.
– Recall that NATURAL LANGUAGE MODE does not take the operands as the BOOLEAN MODE. This affects the ranking score (try +bad (thing) i.e.)
– If you plan to order by rank, it is not necessary to specify the clause `ORDER BY` as InnoDB does the order after retrieve the doc ids . Also,the behavior is different from the default as it returns the heaviest at the top (like an ORDER BY rank DESC).
– If you come from MyISAM’s FTS implementation, recall that the ranking scoring is different.
– Create the FULLTEXT index after the data is loaded InnoDB Bulk Load. When restoring FTS backups, you will probably hit the “ERROR 182 (HY000) at line nn: Invalid InnoDB FTS Doc ID”.
– Try to avoid using use more than one FTS expression in the where clause. Keep in mind that this affects the order in the results and it consumes a considerably amount of CPU. InnoDB orders by the latest expression in the WHERE clause. WL#7123.
– Also, if avoiding the rank information in the projection (SELECT clause) and using other aggregations like count(*), will use the “no ranking” FT_hints. The LIMIT
hint won’t be used if invoked explicitly an ORDER BY
and the MATCH
clause in the projection.
– If you plan to use FTS_DOC_ID column with AUTO_INCREMENT option, have in mind that there is a limitation regarding this. You must declare a single column PRIMARY KEY constraint or as an UNIQUE index. Also, the data type is stricted as `bigint unsigned`. i.e:
FT_QUERY_EXPANSION_LIMIT
This variable controls the number of top matches when using `WITH QUERY EXPANSION` (affects only MyISAM). Reference.
Bug 80347 – Invalid InnoDB FTS Doc ID
[email protected] ~/sandboxes/rsandbox_5_7_9 $ ./m dumpTest < full.dump
ERROR 182 (HY000) at line 73: Invalid InnoDB FTS Doc ID
[email protected] ~/sandboxes/rsandbox_5_7_9 $ ./m dumpTest < ddl.dump
[email protected] ~/sandboxes/rsandbox_5_7_9 $ ./m dumpTest < onlyData.dump
[email protected] ~/sandboxes/rsandbox_5_7_9 $ ./m dumpTest < full.dump
ERROR 182 (HY000) at line 73: Invalid InnoDB FTS Doc ID
mysqldump is not very clever if you use `FTS_DOC_ID`:
2016-02-13T22:11:53.125300Z 19 [ERROR] InnoDB: Doc ID 10002 is too big. Its difference with largest used Doc ID 1 cannot exceed or equal to 10000
It takes dumps without considering the restriction coded in `innobase/row/row0mysql.cc`:
Difference between Doc IDs are restricted within
4 bytes integer. See fts_get_encoded_len()
The fix to this is backuping the table by chunks of 10000 documents.
Other useful links
Fine tuning
Maintenance: innodb_optimize_fulltext_only
Writing FTS parser plugins
1 Comment. Leave new
In our case, we had to lower the minimum token length to 2 which was killing the database.
Unfortunately the stopword list was dependent on the table (not the server or database). So we decided to build the list in the application.
Also we noticed that Last_query_cost was always smaller if we order the keywords from the most to the least discriminative. In our case it was easy because it was the string length.
Ex: “The Motor is 345” -> “+motor* +345 +is” (the: stop word, is: reference code so it is kept, 345: reference code)