Postgres partial indexes on email address domains

Posted in: PostgreSQL, Technical Track

When I first heard of Postgres partial indexes, I knew immediately that this would have solved a problem I had in MySQL about a decade ago. That problem didn’t go unsolved, but it certainly wasn’t as easy as I’ll demonstrate below.

The situation: billions and billions of email addresses. You need to find a small handful of email addresses from a particular domain.

Populate data

The test data has mostly email addresses from a large known domain. Our test data is only in the low millions, but it will give us enough to work with. The percentage column was created with random() so we can use it for domain population. Names were imported from sample data at https://github.com/solvenium/names-dataset.

update users set email=concat(name, '@gmail.com') where percentage>.20;
UPDATE 3060977

And some from another known domain.

update users set email=concat(name, '@yahoo.com') where email not like '%gmail.com';
UPDATE 763903

Finally, a small handful of email addresses from our target domain.

update users set email=concat(name, '@aol.com') where percentage<.0005;
UPDATE 1841

No index

Trust me, this takes a lot longer when row counts are in the billions.

postgres=# explain analyze verbose select * from users where email like '%aol.com';

QUERY PLAN                                                           
----------------------------------------------------
Gather  (cost=1000.00..53368.45 rows=382 width=35) (actual time=2.058..453.503 rows=1841 loops=1)
   Output: name, email, percentage
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on public.users  (cost=0.00..52330.25 rows=159 width=35) (actual time=0.478..339.496 rows=614 loops=3)
         Output: name, email, percentage
         Filter: ((users.email)::text ~~ '%aol.com'::text)
         Rows Removed by Filter: 1274346
         Worker 0: actual time=0.151..279.486 rows=349 loops=1
         Worker 1: actual time=0.040..289.677 rows=737 loops=1
 Planning Time: 0.301 ms
 Execution Time: 453.742 ms
(12 rows)

Create index

This index isn’t going to help due to the left-most wildcard, but let’s prove it.

create index idx_user_email on users(email);
CREATE INDEX

postgres=# explain analyze verbose select * from users where email like '%aol.com';

QUERY PLAN                                                           
-------------------------------------------------------------
 Gather  (cost=1000.00..53368.45 rows=382 width=35) (actual time=2.775..455.128 rows=1841 loops=1)
   Output: name, email, percentage
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on public.users  (cost=0.00..52330.25 rows=159 width=35) (actual time=0.714..343.353 rows=614 loops=3)
         Output: name, email, percentage
         Filter: ((users.email)::text ~~ '%aol.com'::text)
         Rows Removed by Filter: 1274346
         Worker 0: actual time=0.140..279.955 rows=311 loops=1
         Worker 1: actual time=0.123..299.223 rows=823 loops=1
 Planning Time: 0.306 ms
 Execution Time: 455.387 ms
(12 rows)

Create like index

This was my first idea for a partial index. Partial indexes, briefly, let you leverage a “where” clause in the index itself. The results were impressive.

create index idx_user_domain on users(email) where email like '%aol.com';
CREATE INDEX

explain analyze verbose select * from users where email like '%aol.com';

QUERY PLAN                                                          
----------------------------------------------------------------
 Bitmap Heap Scan on public.users  (cost=14.28..1415.62 rows=382 width=35) (actual time=1.917..14.224 rows=1841 loops=1)
   Output: name, email, percentage
   Recheck Cond: ((users.email)::text ~~ '%aol.com'::text)
   Heap Blocks: exact=1602
   ->  Bitmap Index Scan on idx_user_domain  (cost=0.00..14.19 rows=382 width=0) (actual time=0.988..0.989 rows=1841 loops=1)
 Planning Time: 1.849 ms
 Execution Time: 14.331 ms
(7 rows)

Reverse

One of the problems in searching for domains in email addresses has to do with the variable length of strings before the “@” symbol. This is why the wildcard is needed. A common trick is to reverse the email string to force a more standard length. Without rewriting the reversed string and indexing it, you’re still looking at a sequential scan.

postgres=# explain analyze verbose select email from users where left(reverse(email::text),8)='moc.loa';

QUERY PLAN                                                            
-------------------------------------------------------------
 Gather  (cost=1000.00..63211.15 rows=19124 width=17) (actual time=593.064..596.289 rows=0 loops=1)
   Output: email
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on public.users  (cost=0.00..60298.75 rows=7968 width=17) (actual time=547.845..547.846 rows=0 loops=3)
         Output: email
         Filter: ("left"(reverse((users.email)::text), 8) = 'moc.loa'::text)
         Rows Removed by Filter: 1274960
         Worker 0: actual time=526.251..526.251 rows=0 loops=1
         Worker 1: actual time=526.283..526.283 rows=0 loops=1
 Planning Time: 0.548 ms
 Execution Time: 596.420 ms
(12 rows)

What if we combined a partial index with this idea?

create index idx_user_reverse on users(email) where left(reverse(email::text),8)='moc.loa';

explain analyze verbose select email from users where left(reverse(email::text),8)='moc.loa';

QUERY PLAN                                                          
------------------------------------------------------------
 Bitmap Heap Scan on public.users  (cost=8.91..29508.71 rows=19124 width=17) (actual time=0.004..0.005 rows=0 loops=1)
   Output: email
   Recheck Cond: ("left"(reverse((users.email)::text), 8) = 'moc.loa'::text)
   ->  Bitmap Index Scan on idx_user_reverse  (cost=0.00..4.13 rows=19124 width=0) (actual time=0.002..0.002 rows=0 loops=1)
 Planning Time: 0.534 ms
 Execution Time: 0.025 ms
(6 rows)

Oh, yes. Very nice.

email

Interested in working with Valerie? Schedule a tech call.

About the Author

Internal Principal Consultant
With experience as an open-source DBA and developer for software-as-a-service environments, Valerie has expertise in web-scale data storage and data delivery, including MySQL, Cassandra, Postgres, and MongoDB.

4 Comments. Leave new

Hi Valerie

Nice article!

In PostgreSQL there is a more efficient index type you can use for your scenario: GIN indexes

They are available from the pg_trgm extension that is part of the PostgreSQL standard build.

Just do the following if you want to use case insensitive:

create extension pg_trgm;
create index idx_user_domain2 on users using gin(email gin_trgm_ops);

That’s it!

Know you can query anything with a % at the begin, in the middle or at the end:

explain analyze verbose select email from users where lower(email) like ‘%@aol.com’;
explain analyze verbose select email from users where lower(email) like ‘%@aol.%’;
explain analyze verbose select email from users where lower(email) like ‘peter.%@aol.com’;
explain analyze verbose select email from users where lower(email) like ‘peter.%@aol.%’;

Reply
Valerie Parham-Thompson
August 21, 2020 8:20 am

Thank you for this! I will give these a try for a next blog post.

Reply
Patrick TJ McPhee
August 21, 2020 1:01 am

How about

create index idx_user_domain2 on users(split_part(email,’@’,2));
select email from users where split_part(email,’@’,2) = ‘aol.com’;

Reply
Valerie Parham-Thompson
August 21, 2020 8:18 am

Nice option using string functions! Thank you for sharing this.

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *