I love finding out new things about MySQL. Last week, I stumbled on a query that had the phrase “SOUNDS LIKE” in it. Sounds made-up, right? Turns out MySQL is using a known “soundex” algorithm common to most databases, and popular in use cases in geneaology.
The basic idea is that words are encoded according to their consonants. Consonants that sound similar (like M and N) are given the same code. Here’s a simple example:
(“soundex” and “sounds like” are different ways of doing the same thing in these queries)
MariaDB> select soundex("boom"); +-----------------+ | soundex("boom") | +-----------------+ | B500 | +-----------------+ MariaDB> select soundex("bam"); +----------------+ | soundex("bam") | +----------------+ | B500 | +----------------+ MariaDB> select soundex("bin"); +----------------+ | soundex("bin") | +----------------+ | B500 | +----------------+
This simple example isn’t terribly useful, but if you were trying to find similar, but differently spelled, names across continents, it could be helpful:
MariaDB> select soundex("William"); +--------------------+ | soundex("William") | +--------------------+ | W450 | +--------------------+ MariaDB> select soundex("Walaam"); +-------------------+ | soundex("Walaam") | +-------------------+ | W450 | +-------------------+ MariaDB> select soundex("Willem"); +-------------------+ | soundex("Willem") | +-------------------+ | W450 | +-------------------+ MariaDB> select soundex("Williama"); +---------------------+ | soundex("Williama") | +---------------------+ | W450 | +---------------------+
And you could probably agree these variations match as well:
MariaDB> select soundex("Guillaume"); +----------------------+ | soundex("Guillaume") | +----------------------+ | G450 | +----------------------+ MariaDB> select soundex("Uilleam"); +--------------------+ | soundex("Uilleam") | +--------------------+ | U450 | +--------------------+ MariaDB> select soundex("Melhem"); +-------------------+ | soundex("Melhem") | +-------------------+ | M450 | +-------------------+ MariaDB> select soundex("Uilliam"); +--------------------+ | soundex("Uilliam") | +--------------------+ | U450 | +--------------------+
Well, that’s pretty neat. Of course, I want to try the silliest word I can think of:
MariaDB> select soundex("supercalifragilisticexpealidocious"); +-----------------------------------------------+ | soundex("supercalifragilisticexpealidocious") | +-----------------------------------------------+ | S162416242321432 | +-----------------------------------------------+
So the algorithm doesn’t stop at 3 digits; good to know.
What does the algorithm do? Luckily MySQL is open source, and so we can look in the source code:
This looks like the raw mapping. And then this is called into a function that loops through the characters in the word.
/* ABCDEFGHIJKLMNOPQRSTUVWXYZ */ /* :::::::::::::::::::::::::: */ const char *soundex_map= "01230120022455012623010202";
Note that even though it’s called “sounds like” it is really simply a character mapping based on an agreement by the developers’ ears which characters sounds similar. That’s of course an oversimplification, and I see in the code comments the following:
/**************************************************************** * SOUNDEX ALGORITHM in C * * * * The basic Algorithm source is taken from EDN Nov. * * 14, 1985 pg. 36. *
But despite hitting up several librarians, I can’t seem to get a copy of this. Someone out there has a copy sitting around, right?
As a side note, this is obviously specific to the English language. I found references to German and other languages having soundex mappings, and would be curious to see those and hear of any language-specific ways to do this.
Curiosity aside, here’s a real use.
I pulled down some government climate data. Let’s say the location field has some of my favorite misspellings of “Durham” in it:
MariaDB [weather]> select distinct(two), count(two) from weather.temps group by two; +--------------------------------------------+------------+ | two | count(two) | +--------------------------------------------+------------+ | NULL | 0 | | DRM | 51 | | DURHAM | 1101887 | | DURM | 71 | | NCSU | 1000000 | | RALEIGH DURHAM INTERNATIONAL AIRPORT NC US | 1096195 | | RDU AIRPORT | 1000000 | +--------------------------------------------+------------+
A “LIKE” clause won’t work terribly well here.
I confirmed the misspellings would match as I expected:
MariaDB [weather]> select soundex("Durham"), soundex("Durm"), soundex("DRM"); +-------------------+-----------------+----------------+ | soundex("Durham") | soundex("Durm") | soundex("DRM") | +-------------------+-----------------+----------------+ | D650 | D650 | D650 | +-------------------+-----------------+----------------+
So instead of manually creating a query like:
MariaDB [weather]> select count(two) from weather.temps where two='DRM' or two='DURHAM' or two='DURM'; +------------+ | count(two) | +------------+ | 1102009 | +------------+
I can simply do this:
MariaDB [weather]> select count(two) from weather.temps where two sounds like 'Durham'; +------------+ | count(two) | +------------+ | 1102009 | +------------+
There are more than several ways to do string comparisons, but I enjoyed finding this one.
(Bonus points will be granted to the first person who comments that RDU is also Durham and submits a unique query to include it in the count.)
1 Comment. Leave new
This is extremely interesting and I can certainly think of some use cases for it right off the top of my head. However, don’t you worry about rows that should be included not being included and vice versa? i.e. What if the word D-U-R-H-A-M results in a soundex of D630 it definitely should be included in your query above but obviously the odd soundex result could mean it’s not included in the grouping.
What would you do in situations like that?