Discussion:
GIN, pg_trgm and large table
Max Fomichev
2018-03-10 10:31:46 UTC
Permalink
Hello!
I have the following table and index with about 15 billion records.

CREATE TABLE us_en.ngrams
(
    ngram text COLLATE pg_catalog."default" NOT NULL,
    year smallint NOT NULL,
    occurrence bigint NOT NULL,
    words smallint NOT NULL
)
WITH (
    OIDS = FALSE
)
TABLESPACE pg_default;

CREATE INDEX trgm_idx_ngram_ngrams_us_en
    ON us_en.ngrams USING gin
    (ngram COLLATE pg_catalog."default" gin_trgm_ops)
    TABLESPACE pg_default;

Configuration:
PostgreSQL 9.5.12 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609, 64-bit
128GB RAM, 2TB SDD
Server settings:
shared_buffers = 32GB
effective_cache_size = 96GB
work_mem = 4GB
maintenance_work_mem = 16GB
min_wal_size = 4GB
max_wal_size = 8GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
max_worker_processes = 16
gin_fuzzy_search_limit = 10000


My queries related to us_en.ngrams.ngram field are too slow:
ngrams=# set enable_seqscan=false;
SET
ngrams=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM us_en.ngrams WHERE
ngram LIKE '%computer%' LIMIT 20;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=232884.91..232933.12 rows=20 width=36) (actual
time=44962.926..44966.214 rows=14 loops=1)
   Buffers: shared hit=18177345 read=124224
   ->  Bitmap Heap Scan on ngrams  (cost=232884.91..61129746.57
rows=25261021 width=36) (actual time=44962.925..44966.202 rows=14 loops=1)
         Recheck Cond: (ngram ~~ '%computer%'::text)
         Rows Removed by Index Recheck: 10
         Heap Blocks: exact=8
         Buffers: shared hit=18177345 read=124224
         ->  Bitmap Index Scan on trgm_idx_ngram_ngrams_us_en
(cost=0.00..226569.66 rows=25261021 width=0) (actual
time=44961.929..44961.929 rows=24 loops=1)
               Index Cond: (ngram ~~ '%computer%'::text)
               Buffers: shared hit=18177345 read=124216
 Planning time: 0.305 ms
 Execution time: 44966.271 ms
(12 rows)


ngrams=# set enable_seqscan=true;
SET
ngrams=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM us_en.ngrams WHERE
ngram LIKE '%computer%' LIMIT 20;
                                                       QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..215.41 rows=20 width=36) (actual time=0.248..29.089
rows=20 loops=1)
   Buffers: shared hit=14 read=1509
   ->  Seq Scan on ngrams  (cost=0.00..272072628.00 rows=25261021
width=36) (actual time=0.247..29.074 rows=20 loops=1)
         Filter: (ngram ~~ '%computer%'::text)
         Rows Removed by Filter: 207598
         Buffers: shared hit=14 read=1509
 Planning time: 0.332 ms
 Execution time: 29.117 ms
(8 rows)


ngrams=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM us_en.ngrams WHERE
ngram LIKE '%version%' ORDER BY (occurrence, year) DESC LIMIT 20;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=4661042.23..4661042.28 rows=20 width=36) (actual
time=144417.365..144417.365 rows=0 loops=1)
   Buffers: shared hit=28982531 read=167634
   ->  Sort  (cost=4661042.23..4664172.58 rows=1252138 width=36)
(actual time=144417.363..144417.363 rows=0 loops=1)
         Sort Key: (ROW(occurrence, year)) DESC
         Sort Method: quicksort  Memory: 25kB
         Buffers: shared hit=28982531 read=167634
         ->  Bitmap Heap Scan on ngrams (cost=20372.07..4627723.29
rows=1252138 width=36) (actual time=144417.331..144417.331 rows=0 loops=1)
               Recheck Cond: (ngram ~~ '%version%'::text)
               Buffers: shared hit=28982527 read=167633
               ->  Bitmap Index Scan on trgm_idx_ngram_ngrams_us_en 
(cost=0.00..20059.04 rows=1252138 width=0) (actual
time=144417.328..144417.328 rows=0 loops=1)
                     Index Cond: (ngram ~~ '%version%'::text)
                     Buffers: shared hit=28982527 read=167633
 Planning time: 0.344 ms
 Execution time: 144417.522 ms
(14 rows)

It there any way to improve GIN/pg_tgrm performance on a such large table?
Thank you!
--
Best regards,
Max Fomichev
Tom Lane
2018-03-10 17:00:27 UTC
Permalink
Post by Max Fomichev
I have the following table and index with about 15 billion records.
...
PostgreSQL 9.5.12 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609, 64-bit
...
It there any way to improve GIN/pg_tgrm performance on a such large table?
There was some work done in 9.6 to improve pg_trgm's performance when
dealing with queries involving very common trigrams. So maybe an update
to 9.6 or v10 would help you.

I have a bad feeling though that 15 billion rows is too many for an index
based on trigrams to be really useful --- there are just not enough
distinct trigrams. It's too bad we don't have a more general N-gram
indexing facility.

regards, tom lane
Max Fomichev
2018-03-10 17:44:21 UTC
Permalink
Hello Tom,
thank you for your reply.
Post by Tom Lane
Post by Max Fomichev
I have the following table and index with about 15 billion records.
... PostgreSQL 9.5.12 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609, 64-bit ... It there any way
to improve GIN/pg_tgrm performance on a such large table?
There was some work done in 9.6 to improve pg_trgm's performance when
dealing with queries involving very common trigrams. So maybe an
update to 9.6 or v10 would help you.
I'll try 10.x version.
Post by Tom Lane
I have a bad feeling though that 15 billion rows is too many for an
index based on trigrams to be really useful --- there are just not
enough distinct trigrams. It's too bad we don't have a more general
N-gram indexing facility.
Could you please advise what is the correct approach/index type for my
case?
I have about 15 billion ngram records (each ngram contains from 1 to 5
words). I'd like to find all ngrams where search word is a part of it.
--
Best regards,
Max Fomichev
Tomasz Barszczewski
2018-03-10 20:12:09 UTC
Permalink
Post by Max Fomichev
Hello!
I have the following table and index with about 15 billion records.
tb: Maybe try to use the partition - (CONSTRAINT ... CHECK) smaller
tables are searched.


ps. Also use the primary key - eg pg_repack requires it.
--
Best regards, Tomasz Barszczewski - ***@nop.com.pl
Loading...