Improving fuzzy searches in postgresql with pg_trgm

I have been researching postgresql search capabilities for quite a while. Besides the very powerful Full Text Search capabilities the engine offers today, there are a few less known index types that support string searches in the form :
where foo like '%bar%'

For these cases, a regular b-tree index on ‘foo’ is guaranteed to never be used.
The following example illustrates a case in which a relatively simple query is transformed to take advantage of trigram indexes. In particular, to use the GIN implementation as described in the pg_trgm documentation.

The initial query was :


select distinct a.id as id, a.refcode as rc
from A a
inner join C c on a.id=c.a_id
inner join CP cp on c.id=cp.c_id
inner join AP ap on cp.ap_id=ap.id
inner join R r on a.id=r.a_id
inner join CT ct on r.ct_id=ct.id
left outer join P p on ct.id=p.id
left outer join CC cc on ct.id=cc.id
inner join CM cm on ct.id=cm.ct_id
where (a.expdate is null or a.expdate>'2017-08-15 11:29:51.023')
and (a.refcode like '%679871076%'
or cm.value like '%679871076%'
or ct.code like '%679871076%'
or ap.code like '%679871076%')
and (cp.expdate is null or cp.expdate>'2017-08-15 11:29:51.023')
and a.id in (select a_id from T where tag in ('TAG_FR'))
order by a.refcode desc
limit 20;

In our test environment, this query takes around 40 seconds, which is terribly slow for current standards. It is also very inefficient. Its execution plan is the following :


 Limit  (cost=1326434.10..1326434.15 rows=20 width=58) (actual time=39699.195..39699.199 rows=1 loops=1)
   Buffers: shared hit=9514 read=153980, temp read=184284 written=183832
   ->  Sort  (cost=1326434.10..1326438.87 rows=1910 width=58) (actual time=39699.192..39699.193 rows=1 loops=1)
         Sort Key: a.refcode
         Sort Method: quicksort  Memory: 25kB
         Buffers: shared hit=9514 read=153980, temp read=184284 written=183832
         ->  HashAggregate  (cost=1326364.18..1326383.28 rows=1910 width=58) (actual time=39699.106..39699.109 rows=1 loops=1)
               Group Key: a.refcode, a.id
               Buffers: shared hit=9509 read=153980, temp read=184284 written=183832
               ->  Nested Loop Semi Join  (cost=475791.89..1326354.63 rows=1910 width=58) (actual time=33031.242..39699.085 rows=2 loops=1)
                     Join Filter: ((a.id)::text = (T.a_id)::text)
                     Buffers: shared hit=9509 read=153980, temp read=184284 written=183832
                     ->  Hash Join  (cost=475791.47..1325449.20 rows=1910 width=132) (actual time=33031.070..39698.836 rows=2 loops=1)
                           Hash Cond: ((r.ct_id)::text = (cm.ct_id)::text)
                           Join Filter: ((a.refcode::text) ~~ '%679871076%'::text) OR ((cm.value)::text ~~ '%679871076%'::text) OR ((ct.code)::text ~~ '%679871076%'::text) OR ((ap.code)::text ~~ '%679871076%'::text))
                           Rows Removed by Join Filter: 6326633
                           Buffers: shared hit=9502 read=153980, temp read=184284 written=183832
                           ->  Hash Join  (cost=380591.37..517444.23 rows=1914269 width=228) (actual time=15533.152..22496.971 rows=2754668 loops=1)
                                 Hash Cond: ((r.a_id)::text = (a.id)::text)
                                 Buffers: shared hit=6923 read=116587, temp read=87168 written=86842
                                 ->  Hash Join  (cost=40965.35..103244.23 rows=685096 width=115) (actual time=1576.461..4277.153 rows=685100 loops=1)
                                       Hash Cond: ((r.ct_id)::text = (ct.id)::text)
                                       Buffers: shared hit=3830 read=31202, temp read=12830 written=12800
                                       ->  Seq Scan on role r  (cost=0.00..24543.96 rows=685096 width=74) (actual time=0.033..689.449 rows=685100 loops=1)
                                             Buffers: shared hit=1552 read=16141
                                       ->  Hash  (cost=24889.82..24889.82 rows=755082 width=41) (actual time=1575.934..1575.934 rows=755084 loops=1)
                                             Buckets: 8192  Batches: 16  Memory Usage: 3474kB
                                             Buffers: shared hit=2278 read=15061, temp written=5407
                                             ->  Seq Scan on CC cc  (cost=0.00..24889.82 rows=755082 width=41) (actual time=0.024..770.090 rows=755084 loops=1)
                                                   Buffers: shared hit=2278 read=15061
                                 ->  Hash  (cost=310901.78..310901.78 rows=954980 width=113) (actual time=13953.510..13953.510 rows=955680 loops=1)
                                       Buckets: 4096  Batches: 64  Memory Usage: 2188kB
                                       Buffers: shared hit=3093 read=85385, temp read=47936 written=62665
                                       ->  Hash Join  (cost=155110.54..310901.78 rows=954980 width=113) (actual time=6302.524..12688.915 rows=955680 loops=1)
                                             Hash Cond: ((cp.c_id)::text = (c.id)::text)
                                             Buffers: shared hit=3093 read=85385, temp read=47936 written=47766
                                             ->  Hash Join  (cost=92060.38..202625.15 rows=963766 width=54) (actual time=3661.966..7338.522 rows=955680 loops=1)
                                                   Hash Cond: ((cp.ap_id)::text = (ap.id)::text)
                                                   Buffers: shared hit=624 read=72534, temp read=27295 written=27169
                                                   ->  Seq Scan on CP cp  (cost=0.00..52769.57 rows=963766 width=73) (actual time=0.048..1128.680 rows=955680 loops=1)
                                                         Filter: ((expdate IS NULL) OR (expdate > '2017-08-15 11:29:51.023'::timestamp without time zone))
                                                         Rows Removed by Filter: 100295
                                                         Buffers: shared hit=590 read=38862
                                                   ->  Hash  (cost=51791.50..51791.50 rows=1808550 width=55) (actual time=3661.086..3661.086 rows=1808513 loops=1)
                                                         Buckets: 8192  Batches: 64  Memory Usage: 2438kB
                                                         Buffers: shared hit=34 read=33672, temp written=16278
                                                         ->  Seq Scan on AP ap  (cost=0.00..51791.50 rows=1808550 width=55) (actual time=0.023..1680.872 rows=1808513 loops=1)
                                                               Buffers: shared hit=34 read=33672
                                             ->  Hash  (cost=51218.07..51218.07 rows=369368 width=131) (actual time=2639.382..2639.382 rows=359322 loops=1)
                                                   Buckets: 4096  Batches: 16  Memory Usage: 3636kB
                                                   Buffers: shared hit=2469 read=12851, temp read=6486 written=12716
                                                   ->  Hash Join  (cost=17664.31..51218.07 rows=369368 width=131) (actual time=708.003..2151.838 rows=359322 loops=1)
                                                         Hash Cond: ((c.a_id)::text = (a.id)::text)
                                                         Buffers: shared hit=2469 read=12851, temp read=6486 written=6472
                                                         ->  Seq Scan on C c  (cost=0.00..13555.44 rows=370244 width=73) (actual time=0.031..380.244 rows=360029 loops=1)
                                                               Buffers: shared hit=476 read=9377
                                                         ->  Hash  (cost=9739.21..9739.21 rows=340968 width=58) (actual time=707.736..707.736 rows=341074 loops=1)
                                                               Buckets: 8192  Batches: 8  Memory Usage: 3776kB
                                                               Buffers: shared hit=1993 read=3474, temp written=2841
                                                               ->  Seq Scan on A a  (cost=0.00..9739.21 rows=340968 width=58) (actual time=0.024..335.101 rows=341074 loops=1)
                                                                     Filter: ((expdate IS NULL) OR (expdate > '2017-08-15 11:29:51.023'::timestamp without time zone))
                                                                     Rows Removed by Filter: 703
                                                                     Buffers: shared hit=1993 read=3474
                           ->  Hash  (cost=57088.49..57088.49 rows=1711649 width=53) (actual time=3569.603..3569.603 rows=1711613 loops=1)
                                 Buckets: 8192  Batches: 64  Memory Usage: 2318kB
                                 Buffers: shared hit=2579 read=37393, temp written=15362
                                 ->  Seq Scan on CM cm  (cost=0.00..57088.49 rows=1711649 width=53) (actual time=0.077..1698.498 rows=1711614 loops=1)
                                       Buffers: shared hit=2579 read=37393
                     ->  Index Only Scan using t_tagid_idx on T  (cost=0.42..0.46 rows=1 width=37) (actual time=0.097..0.097 rows=1 loops=2)
                           Index Cond: ((tag = 'TAG_FR'::text) AND (a_id = (r.a_id)::text))
                           Heap Fetches: 0
                           Buffers: shared hit=7
 Planning time: 17.321 ms
 Execution time: 39703.950 ms

Certainly there should be a way to improve it. We made 2 things :

Create GIN indexes on the search fields.
Transform the original query in order to take advantage of the indexes.

The transformed query is semantically similar to the original one (with some minor constraints removed on the expdate). It ended up looking as follows :


select distinct id, refcode from
(select  a.id as id, a.refcode as refcode
from A a
where a.id in (select a_id from A a inner join T t on a.id=t.a_id where t.tag in ('TAG_FR') and (a.expdate is null or a.expdate>'2017-08-15 11:29:51.023') and a.refcode like '%679871076%'
union select a_id from R r inner join CT ct on r.ct_id=ct.id inner join CM cm on ct.id=cm.ct_id  where cm.value like '%679871076%'
union select a_id from C c inner join CP cp on c.id=cp.c_id inner join AP ap on cp.ap_id=ap.id where ap.code like '%679871076%' and (cp.expdate is null or cp.expdate>'2017-08-15 11:29:51.023')
union select a_id from R r inner join CT ct on r.ct_id=ct.id where ct.code like '%679871076%'
)) main
order by refcode desc
limit 20;

This time, the real power of GIN indexes come into play and postgresql query performance just shines :


 Limit  (cost=30127.87..30128.02 rows=20 width=58) (actual time=13.993..13.997 rows=1 loops=1)
   Buffers: shared hit=183
   ->  Unique  (cost=30127.87..31409.53 rows=170888 width=58) (actual time=13.991..13.994 rows=1 loops=1)
         Buffers: shared hit=183
         ->  Sort  (cost=30127.87..30555.09 rows=170888 width=58) (actual time=13.988..13.989 rows=1 loops=1)
               Sort Key: a.refcode, a.id
               Sort Method: quicksort  Memory: 25kB
               Buffers: shared hit=183
               ->  Nested Loop  (cost=7234.67..8849.40 rows=170888 width=58) (actual time=13.909..13.913 rows=1 loops=1)
                     Buffers: shared hit=178
                     ->  HashAggregate  (cost=7234.25..7236.25 rows=200 width=516) (actual time=13.857..13.858 rows=1 loops=1)
                           Group Key: ("ANY_subquery".a_id)::text
                           Buffers: shared hit=174
                           ->  Subquery Scan on "ANY_subquery"  (cost=7226.11..7233.35 rows=362 width=516) (actual time=13.840..13.842 rows=1 loops=1)
                                 Buffers: shared hit=174
                                 ->  HashAggregate  (cost=7226.11..7229.73 rows=362 width=37) (actual time=13.837..13.838 rows=1 loops=1)
                                       Group Key: t.a_id
                                       Buffers: shared hit=174
                                       ->  Append  (cost=84.69..7225.20 rows=362 width=37) (actual time=5.386..13.831 rows=2 loops=1)
                                             Buffers: shared hit=174
                                             ->  Nested Loop  (cost=84.69..364.29 rows=34 width=37) (actual time=0.897..0.897 rows=0 loops=1)
                                                   Buffers: shared hit=18
                                                   ->  Bitmap Heap Scan on A a  (cost=84.27..212.82 rows=34 width=37) (actual time=0.895..0.895 rows=0 loops=1)
                                                         Recheck Cond: ((refcode::text) ~~ '%679871076%'::text)
                                                         Filter: ((expdate IS NULL) OR (expdate > '2017-08-15 11:29:51.023'::timestamp without time zone))
                                                         Buffers: shared hit=18
                                                         ->  Bitmap Index Scan on a_ref_trg_idx  (cost=0.00..84.26 rows=34 width=0) (actual time=0.892..0.892 rows=0 loops=1)
                                                               Index Cond: (refcode::text) ~~ '%679871076%'::text)
                                                               Buffers: shared hit=18
                                                   ->  Index Only Scan using tag_tagid_idx on tag t  (cost=0.42..4.45 rows=1 width=37) (never executed)
                                                         Index Cond: ((tag = 'TAG_FR'::text) AND (a_id = (a.id)::text))
                                                         Heap Fetches: 0
                                             ->  Nested Loop  (cost=90.16..1576.51 rows=181 width=37) (actual time=4.486..4.594 rows=2 loops=1)
                                                   Buffers: shared hit=66
                                                   ->  Nested Loop  (cost=89.73..1479.35 rows=169 width=72) (actual time=4.433..4.488 rows=2 loops=1)
                                                         Buffers: shared hit=58
                                                         ->  Bitmap Heap Scan on CM cm  (cost=89.31..734.45 rows=169 width=36) (actual time=4.355..4.366 rows=2 loops=1)
                                                               Recheck Cond: ((value)::text ~~ '%679871076%'::text)
                                                               Heap Blocks: exact=2
                                                               Buffers: shared hit=51
                                                               ->  Bitmap Index Scan on cm_value_trg_idx  (cost=0.00..89.26 rows=169 width=0) (actual time=4.332..4.332 rows=2 loops=1)
                                                                     Index Cond: ((value)::text ~~ '%679871076%'::text)
                                                                     Buffers: shared hit=49
                                                         ->  Index Only Scan using ct_pk on CT ct  (cost=0.42..4.40 rows=1 width=36) (actual time=0.048..0.049 rows=1 loops=2)
                                                               Index Cond: (id = (cm.ct_id)::text)
                                                               Heap Fetches: 0
                                                               Buffers: shared hit=7
                                                   ->  Index Scan using r_ct_idx on R r  (cost=0.42..0.56 rows=1 width=74) (actual time=0.047..0.048 rows=1 loops=2)
                                                         Index Cond: ((ct_id)::text = (ct.id)::text)
                                                         Buffers: shared hit=8
                                             ->  Nested Loop  (cost=70.25..4491.35 rows=95 width=37) (actual time=7.958..7.958 rows=0 loops=1)
                                                   Buffers: shared hit=69
                                                   ->  Nested Loop  (cost=69.83..4442.92 rows=96 width=36) (actual time=7.956..7.956 rows=0 loops=1)
                                                         Buffers: shared hit=69
                                                         ->  Bitmap Heap Scan on AP ap  (cost=69.40..755.87 rows=181 width=37) (actual time=7.955..7.955 rows=0 loops=1)
                                                               Recheck Cond: ((code)::text ~~ '%679871076%'::text)
                                                               Buffers: shared hit=69
                                                               ->  Bitmap Index Scan on ap_code_trg_idx  (cost=0.00..69.36 rows=181 width=0) (actual time=7.953..7.953 rows=0 loops=1)
                                                                     Index Cond: ((code)::text ~~ '%679871076%'::text)
                                                                     Buffers: shared hit=69
                                                         ->  Index Scan using cp_apid_idx on CP cp  (cost=0.43..20.34 rows=3 width=73) (never executed)
                                                               Index Cond: ((ap_id)::text = (ap.id)::text)
                                                               Filter: ((expdate IS NULL) OR (expdate > '2017-08-15 11:29:51.023'::timestamp without time zone))
                                                   ->  Index Scan using c_pk on C c  (cost=0.42..0.49 rows=1 width=73) (never executed)
                                                         Index Cond: ((id)::text = (cp.c_id)::text)
                                             ->  Nested Loop  (cost=88.87..789.43 rows=52 width=37) (actual time=0.373..0.373 rows=0 loops=1)
                                                   Buffers: shared hit=21
                                                   ->  Bitmap Heap Scan on CT contact2__1  (cost=88.45..307.50 rows=57 width=36) (actual time=0.371..0.371 rows=0 loops=1)
                                                         Recheck Cond: ((code::text) ~~ '%679871076%'::text)
                                                         Buffers: shared hit=21
                                                         ->  Bitmap Index Scan on ct_code_trg_idx  (cost=0.00..88.43 rows=57 width=0) (actual time=0.369..0.369 rows=0 loops=1)
                                                               Index Cond: ((code::text) ~~ '%679871076%'::text)
                                                               Buffers: shared hit=21
                                                   ->  Index Scan using r_ct_idx on R r  (cost=0.42..8.45 rows=1 width=74) (never executed)
                                                         Index Cond: ((ct_id)::text = (ct.id)::text)
                     ->  Index Scan using a_pk on A a  (cost=0.42..8.06 rows=1 width=58) (actual time=0.048..0.049 rows=1 loops=1)
                           Index Cond: ((id)::text = ("ANY_subquery".a_id)::text)
                           Buffers: shared hit=4
 Planning time: 10.352 ms
 Execution time: 14.596 ms

From 40 seconds to 15 miliseconds, this is quite a boost. In addition, the logical I/O is reduced to its minimal epression. The planner made full usage of the GIN indexes created on the search fields (prefixed with _trg_idx).

There are a few drawbacks, however :

GIN indexes are relatively large.
GIN indexes are slower to create and to update. Think twice before creating them on fields that are updated frequently. GIST indexes might be better in these cases, but at the cost of a lower search precision.
GIN indexes will be picked by the planner only if you use at least 3 characters in your LIKE token. Watch out for this one, if you use less than 3 characters in your search condition the planner will seqscan the table or use another index for another eventual filter in your where clause.

The time has come now to chill out with a GIN tonic.

Alexis

Alexis is the founder of Aleph Technologies, a data infrastructure consulting and professional services provider based in Brussels, Belgium.

Aleph Technologies Blog

Improving fuzzy searches in postgresql with pg_trgm

Alexis

Leave a Reply Cancel reply