Implementing Fuzzy search in SQL server

Posted in: Microsoft SQL Server, Technical Track

 

Fuzzy Search in SQL Server is not done very well. This post will cover how to implement Fuzzy Search capabilities using several approaches.

 

What is it?

Fuzzy Search is the process to locate records that are relevant to a search, even when the search criteria doesn’t match. Fuzzy Searches are used to:

  1. Suggest the correct spelling of a word (“Did you mean this…”).
  2. Find results related to your search term (“You might also like…”).
  3. Finding synonyms for search terms (Search for Dog and also get results for Puppies and Pets).
  4. Probably other things…

What we’re covering here relates to finding results after a search term has been misspelled. Related Results & Synonyms are handled quite a bit differently, and are more like geographic points on a map used to find the closest other points.

 

What Does it Fix?

As suggested above, Fuzzy Search logic is great for when you or the users don’t know the exact term they’re looking for.

For example, let’s say I want to find the latest James Bond movie, but only vaguely know its name.

An IMDB search for “Specter” brings up the James Bond “Spectre” movie as a suggestion today. If you actually search for the term, you’ll get a wide variety of increasingly weird results (And also the right answer!). I don’t have access to the IMDB code, but I believe they’re using a combination of the Levenshtein Distance, Trigrams, and maybe Double Metaphones. Coincidentally, those are exactly what we’re covering here.

Search on IMDB using Spector

An example search on IMDB

 

Fuzzy Search in SQL Server

Before implementing Fuzzy Search in SQL Server, I’m going to define what each function does. As you’ll see, they are all complementary to each other and can be used together to return a wide range of results that would be missed with traditional queries or even just one of these functions.

The three functions we’re going to look at are:

  1. Levenshtein Distance
  2. Trigrams
  3. Double Metaphones

All of these have been written in T-SQL by very good DBAs. In future posts, I’ll be comparing the T-SQL performance to CLRs written in C# code to hopefully move the string manipulation and comparisons to a more appropriate place.

Levenshtein Distance

The Levenshtein Distance is a calculation of how different two strings are, and it’s expressed as the number of steps required to make StringA look like StringB. The steps are counted in terms of Inserts, Updates, and Deletes of individual letters in the two words being compared.

This is good for short strings where not many differences are expected, AKA misspelled words.

A simple example is the word Car to Date. The Levenshtein Distance here is 3, and we get there by:

Step 1: UPDATE ‘C’ to ‘D’
Step 2: UPDATE ‘r’ to ‘t’
Step 3: APPEND ‘e’

So, in the IMDB example, my search for “Specter” has a Levenshtein Distance of 2 from “Spectre”. Again, we get there by:

STEP 1: UPDATE ‘r’ to ‘e’
STEP 2: UPDATE ‘e’ to ‘r’

There is also the Damerau-Levenshtein Distance. This is built on the Levenshtein Distance, but accounts for transpositions of letters right next to each other. That would have returned a value of 1 with this logic:
STEP 1: Flip ‘r’ and ‘e’

Trigrams

Trigrams are used to find matching sets of words or characters in words or phrases. As the name implies, each Trigram is a set of 3 characters or words, and you simply count how many trigrams in each string match the other string’s trigrams to get a number.

These are great for comparing phrases.

An easy example is to compare a search for “SQL Server” to “SQL Sever”:

STEP 1: Break ‘SQL Server’ up into trigrams
‘SQL’, ‘QL ‘, ‘L S’, ‘ Se’, ‘Ser’, ‘erv’, ‘rve’, ‘ver’
STEP 2: Break ‘SQL Sever’ up into trigrams
‘SQL’, ‘QL ‘, ‘L S’, ‘ Se’, ‘Sev’, ‘eve’, ‘ver’
STEP 3: Count the number of trigrams that match: 5

A matching set of 5 trigrams might mean these are close enough for the application to suggest as an alternative.

Another example is comparing the phrase “Oracle” to “Relational Database Management System”.

STEP 1: Break ‘Oracle’ up into trigrams
‘Ora’, ‘rac’, ‘acl’, ‘cle’
STEP 2: Break ‘Relational Database Management System’ into trigrams
‘Rel’, ‘ela’, ‘lat’, ‘ati’, ‘ona’, ‘nal’, ‘al_’, ‘l_D’, …, ‘tem’
STEP 3: Count the number of trigrams that match between them: 0

As you can see, Oracle isn’t very close to being an RDBMS at all.

Finally, in our IMDB example, you can see that the movie Unexpected was returned. Why? I don’t actually know, but I believe it’s because there are several matching trigrams between “Specter” and “Unexpected” which pushed it into the possible suggestions.

Search on IMDB using Spector

I don’t like scrolling up.

Double Metaphone

Double Metaphones are the updated version of the SOUNDEX() function. Updated as in soundex was first patented in 1918, and double metaphones are from the 1990’s. They both work the same by breaking up consonants into what they sound like and comparing them to the closest matching values; however, soundex assumes each consonant has the same pronunciation, while metaphones allow for multiple pronunciations of the same word.

Double Metaphones are geared towards names

The first example is one where SOUNDEX and a Double Metaphone work identically. The name “Bruce” and it’s common(?) misspelling “Broos”.

STEP 1: SELECT SOUNDEX(‘Broos’) and we get B620
STEP 2: SELECT SOUNDEX(‘Bruce’) and we get B620
STEP 3: Use an online calculator to get the Metaphone values for Broos and Bruce
Both return ‘PRS’ and ‘PRS’

The benefit of the Double Metaphone here is that, because the results for both words are the same, you can have a higher level of confidence that this is a misspelling.

Another example is “Smith” and “Schmidt”.

STEP 1: SELECT SOUNDEX(‘smith’) and we get S530
STEP 2: SELECT SOUNDEX(‘schmidt’) and we get S530
STEP 3: Use an online calculator to get the Metaphone values for Smith and Schmidt
The most common pronunciation is first:
Smith yields ‘SM0’ and ‘XMT’
Schmidt yields ‘XMT’ and ‘SMT’

Using the SOUNDEX() function, you might return this “Smith” as a misspelling of “Schmidt”, which is unlikely. On the other hand, using a double metaphone function, you would know that while these two words are occasionally pronounced identically they more frequently not pronounced the same, and you could either discard the result or push it to the bottom of your suggestions.

If you do pronounce any of these words the same, here’s a website to compare the expected English pronunciations.

In my next post, I’ll implement each of these as CLRs in C# code. I’ll also point out some nice T-SQL implementations I’ve found and compare performance.

 

Discover more about our expertise in SQL Server

email
Want to talk with an expert? Schedule a call with our team to get the conversation started.

4 Comments. Leave new

Very glad to see you website, I met a problem about the fuzzy match in sql server, I would be grateful if you can give me some suggestions, thanks in advance.
I have a table with many columns, some columns have similar names, but record different data, for example, select * from table1, which will list all the columns. But I don’t want all the columns, I want to choose some similar columns, like contain keyword ‘date’, I used select * from table where * like %date%, I know this does not work, but it works, then that is what I want. then how to output the columns using fuzzy match for column’s names? thanks again.

Reply

You can search in systables and syscolumns if you are searching for tables metadata.

Reply

Hi,

This is a very interesting post!
I would really like to implement fuzzy matching with Trigrams.

I’m not a n00b in T-SQL, but I just can’t find out how to implement this example?

*****************************************************
An easy example is to compare a search for “SQL Server” to “SQL Sever”:

STEP 1: Break ‘SQL Server’ up into trigrams
‘SQL’, ‘QL ‘, ‘L S’, ‘ Se’, ‘Ser’, ‘erv’, ‘rve’, ‘ver’
STEP 2: Break ‘SQL Sever’ up into trigrams
‘SQL’, ‘QL ‘, ‘L S’, ‘ Se’, ‘Sev’, ‘eve’, ‘ver’
STEP 3: Count the number of trigrams that match: 5
************************************************

Is it possible to make a simple code example?

Reply

Could anyone suggest a method to  do duplicate employee by name analysis, including fuzzy name searching in SQL Server.

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *