Max Fomichev
2018-03-10 10:31:46 UTC
Hello!
I have the following table and index with about 15 billion records.
CREATE TABLE us_en.ngrams
(
ngram text COLLATE pg_catalog."default" NOT NULL,
year smallint NOT NULL,
occurrence bigint NOT NULL,
words smallint NOT NULL
)
WITH (
OIDS = FALSE
)
TABLESPACE pg_default;
CREATE INDEX trgm_idx_ngram_ngrams_us_en
ON us_en.ngrams USING gin
(ngram COLLATE pg_catalog."default" gin_trgm_ops)
TABLESPACE pg_default;
Configuration:
PostgreSQL 9.5.12 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609, 64-bit
128GB RAM, 2TB SDD
Server settings:
shared_buffers = 32GB
effective_cache_size = 96GB
work_mem = 4GB
maintenance_work_mem = 16GB
min_wal_size = 4GB
max_wal_size = 8GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
max_worker_processes = 16
gin_fuzzy_search_limit = 10000
My queries related to us_en.ngrams.ngram field are too slow:
ngrams=# set enable_seqscan=false;
SET
ngrams=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM us_en.ngrams WHERE
ngram LIKE '%computer%' LIMIT 20;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=232884.91..232933.12 rows=20 width=36) (actual
time=44962.926..44966.214 rows=14 loops=1)
Buffers: shared hit=18177345 read=124224
-> Bitmap Heap Scan on ngrams (cost=232884.91..61129746.57
rows=25261021 width=36) (actual time=44962.925..44966.202 rows=14 loops=1)
Recheck Cond: (ngram ~~ '%computer%'::text)
Rows Removed by Index Recheck: 10
Heap Blocks: exact=8
Buffers: shared hit=18177345 read=124224
-> Bitmap Index Scan on trgm_idx_ngram_ngrams_us_en
(cost=0.00..226569.66 rows=25261021 width=0) (actual
time=44961.929..44961.929 rows=24 loops=1)
Index Cond: (ngram ~~ '%computer%'::text)
Buffers: shared hit=18177345 read=124216
Planning time: 0.305 ms
Execution time: 44966.271 ms
(12 rows)
ngrams=# set enable_seqscan=true;
SET
ngrams=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM us_en.ngrams WHERE
ngram LIKE '%computer%' LIMIT 20;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..215.41 rows=20 width=36) (actual time=0.248..29.089
rows=20 loops=1)
Buffers: shared hit=14 read=1509
-> Seq Scan on ngrams (cost=0.00..272072628.00 rows=25261021
width=36) (actual time=0.247..29.074 rows=20 loops=1)
Filter: (ngram ~~ '%computer%'::text)
Rows Removed by Filter: 207598
Buffers: shared hit=14 read=1509
Planning time: 0.332 ms
Execution time: 29.117 ms
(8 rows)
ngrams=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM us_en.ngrams WHERE
ngram LIKE '%version%' ORDER BY (occurrence, year) DESC LIMIT 20;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=4661042.23..4661042.28 rows=20 width=36) (actual
time=144417.365..144417.365 rows=0 loops=1)
Buffers: shared hit=28982531 read=167634
-> Sort (cost=4661042.23..4664172.58 rows=1252138 width=36)
(actual time=144417.363..144417.363 rows=0 loops=1)
Sort Key: (ROW(occurrence, year)) DESC
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=28982531 read=167634
-> Bitmap Heap Scan on ngrams (cost=20372.07..4627723.29
rows=1252138 width=36) (actual time=144417.331..144417.331 rows=0 loops=1)
Recheck Cond: (ngram ~~ '%version%'::text)
Buffers: shared hit=28982527 read=167633
-> Bitmap Index Scan on trgm_idx_ngram_ngrams_us_en
(cost=0.00..20059.04 rows=1252138 width=0) (actual
time=144417.328..144417.328 rows=0 loops=1)
Index Cond: (ngram ~~ '%version%'::text)
Buffers: shared hit=28982527 read=167633
Planning time: 0.344 ms
Execution time: 144417.522 ms
(14 rows)
It there any way to improve GIN/pg_tgrm performance on a such large table?
Thank you!
I have the following table and index with about 15 billion records.
CREATE TABLE us_en.ngrams
(
ngram text COLLATE pg_catalog."default" NOT NULL,
year smallint NOT NULL,
occurrence bigint NOT NULL,
words smallint NOT NULL
)
WITH (
OIDS = FALSE
)
TABLESPACE pg_default;
CREATE INDEX trgm_idx_ngram_ngrams_us_en
ON us_en.ngrams USING gin
(ngram COLLATE pg_catalog."default" gin_trgm_ops)
TABLESPACE pg_default;
Configuration:
PostgreSQL 9.5.12 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609, 64-bit
128GB RAM, 2TB SDD
Server settings:
shared_buffers = 32GB
effective_cache_size = 96GB
work_mem = 4GB
maintenance_work_mem = 16GB
min_wal_size = 4GB
max_wal_size = 8GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
max_worker_processes = 16
gin_fuzzy_search_limit = 10000
My queries related to us_en.ngrams.ngram field are too slow:
ngrams=# set enable_seqscan=false;
SET
ngrams=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM us_en.ngrams WHERE
ngram LIKE '%computer%' LIMIT 20;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=232884.91..232933.12 rows=20 width=36) (actual
time=44962.926..44966.214 rows=14 loops=1)
Buffers: shared hit=18177345 read=124224
-> Bitmap Heap Scan on ngrams (cost=232884.91..61129746.57
rows=25261021 width=36) (actual time=44962.925..44966.202 rows=14 loops=1)
Recheck Cond: (ngram ~~ '%computer%'::text)
Rows Removed by Index Recheck: 10
Heap Blocks: exact=8
Buffers: shared hit=18177345 read=124224
-> Bitmap Index Scan on trgm_idx_ngram_ngrams_us_en
(cost=0.00..226569.66 rows=25261021 width=0) (actual
time=44961.929..44961.929 rows=24 loops=1)
Index Cond: (ngram ~~ '%computer%'::text)
Buffers: shared hit=18177345 read=124216
Planning time: 0.305 ms
Execution time: 44966.271 ms
(12 rows)
ngrams=# set enable_seqscan=true;
SET
ngrams=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM us_en.ngrams WHERE
ngram LIKE '%computer%' LIMIT 20;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..215.41 rows=20 width=36) (actual time=0.248..29.089
rows=20 loops=1)
Buffers: shared hit=14 read=1509
-> Seq Scan on ngrams (cost=0.00..272072628.00 rows=25261021
width=36) (actual time=0.247..29.074 rows=20 loops=1)
Filter: (ngram ~~ '%computer%'::text)
Rows Removed by Filter: 207598
Buffers: shared hit=14 read=1509
Planning time: 0.332 ms
Execution time: 29.117 ms
(8 rows)
ngrams=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM us_en.ngrams WHERE
ngram LIKE '%version%' ORDER BY (occurrence, year) DESC LIMIT 20;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=4661042.23..4661042.28 rows=20 width=36) (actual
time=144417.365..144417.365 rows=0 loops=1)
Buffers: shared hit=28982531 read=167634
-> Sort (cost=4661042.23..4664172.58 rows=1252138 width=36)
(actual time=144417.363..144417.363 rows=0 loops=1)
Sort Key: (ROW(occurrence, year)) DESC
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=28982531 read=167634
-> Bitmap Heap Scan on ngrams (cost=20372.07..4627723.29
rows=1252138 width=36) (actual time=144417.331..144417.331 rows=0 loops=1)
Recheck Cond: (ngram ~~ '%version%'::text)
Buffers: shared hit=28982527 read=167633
-> Bitmap Index Scan on trgm_idx_ngram_ngrams_us_en
(cost=0.00..20059.04 rows=1252138 width=0) (actual
time=144417.328..144417.328 rows=0 loops=1)
Index Cond: (ngram ~~ '%version%'::text)
Buffers: shared hit=28982527 read=167633
Planning time: 0.344 ms
Execution time: 144417.522 ms
(14 rows)
It there any way to improve GIN/pg_tgrm performance on a such large table?
Thank you!
--
Best regards,
Max Fomichev
Best regards,
Max Fomichev