Indexes for inequalities

Discussion:

Stephen Froehlich

2018-04-13 17:09:58 UTC

Hi All,

I am creating an intersect table where I have a relationship that is true for a period of time and then a series of observations, so we're looking at something like:

SELECT * FROM

observations
INNER JOIN
relationships
ON (observations.id = relationships.id AND observations.time >= relationships.time_from AND observations.time < relationships.time_to)

How do I best build indexes on "relationships", which is a few hundred thousand lines in length for a fast join? Do I build one on all three columns (id, time_from, time_to) or three separate indexes for each column or some other combo? It's a small enough table where space isn't a worry, but speed will be as "observations" is several terabytes in size.

This is PostgreSQL 10, so we can CREATE STATISTICS, but the id, is an md5 (its hashed data), so it is therefore completely orthogonal on its own.

Thanks,
Stephen
________________________________
Stephen Froehlich
Sr. Strategist, CableLabs(r)

***@cablelabs.com<mailto:***@cablelabs.com>
Tel: +1 (303) 661-3708

Tom Lane

2018-04-13 17:24:04 UTC

Permalink

Post by Stephen Froehlich
SELECT * FROM
observations
INNER JOIN
relationships
ON (observations.id = relationships.id AND observations.time >= relationships.time_from AND observations.time < relationships.time_to)
How do I best build indexes on "relationships", which is a few hundred
thousand lines in length for a fast join?

You won't get terribly great results from standard btree indexes on that
sort of range test. If there are not too many relationships entries per
"id" then it might not matter, but if there are a lot then you need decent
index selectivity for the time aspect too. You might do better by
representing the time_from/time_to pair as a range and then using a GIST
index on the range, along the lines of

SELECT * FROM
observations
INNER JOIN
relationships
ON (observations.id = relationships.id AND
observations.time <@ tstzrange(relationships.time_from, relationships.time_to))

I think you'd need the btree_gist extension as well, so that the index
can be like

create index on relationships using gist (id, tstzrange(time_from,time_to));

You could do it just like this and leave the table storage alone, but it
might be better to materialize the range value as an actual column in
the table.

regards, tom lane

Stephen Froehlich

2018-04-13 20:12:28 UTC

Permalink

I'm averaging 1.5 ranges per ID over about 3 months.

Still, it's a good opportunity to learn to do it right.

Follow-up questions:
- what data type do you use to go from RPostgreSQL to a tztrange using dbWriteTable? (Or do I need to make it into a text field in R first ... which is doable.)
- how do you build a constraint that there are no overlapping ranges for a given ID?

Thanks again,
Stephen

-----Original Message-----
From: Tom Lane <***@sss.pgh.pa.us>
Sent: Friday, April 13, 2018 11:24 AM
To: Stephen Froehlich <***@cablelabs.com>
Cc: pgsql-novice <pgsql-***@postgresql.org>
Subject: Re: Indexes for inequalities

Post by Stephen Froehlich
SELECT * FROM
observations
INNER JOIN
relationships
ON (observations.id = relationships.id AND observations.time >=
relationships.time_from AND observations.time < relationships.time_to)
How do I best build indexes on "relationships", which is a few hundred
thousand lines in length for a fast join?

You won't get terribly great results from standard btree indexes on that sort of range test. If there are not too many relationships entries per "id" then it might not matter, but if there are a lot then you need decent index selectivity for the time aspect too. You might do better by representing the time_from/time_to pair as a range and then using a GIST index on the range, along the lines of

SELECT * FROM
observations
INNER JOIN
relationships
ON (observations.id = relationships.id AND observations.time <@ tstzrange(relationships.time_from, relationships.time_to))

I think you'd need the btree_gist extension as well, so that the index can be like

create index on relationships using gist (id, tstzrange(time_from,time_to));

You could do it just like this and leave the table storage alone, but it might be better to materialize the range value as an actual column in the table.

regards, tom lane

David G. Johnston

2018-04-13 21:05:59 UTC

Permalink

On Fri, Apr 13, 2018 at 1:12 PM, Stephen Froehlich <

Post by Stephen Froehlich
I'm averaging 1.5 ranges per ID over about 3 months.
Still, it's a good opportunity to learn to do it right.
- what data type do you use to go from RPostgreSQL to a tztrange using
dbWriteTable? (Or do I need to make it into a text field in R first ...
which is doable.)

âDon't know...â

- how do you build a constraint that there are no overlapping ranges for a

Post by Stephen Froehlich
given ID?

Not sure on the exact syntax you'll need here but the following example
seems to be relevant:

âhttps://www.postgresql.org/docs/10/static/btree-gist.html

CREATE TABLE zoo (
cage INTEGER,
animal TEXT,
EXCLUDE USING GIST (cage WITH =, animal WITH <>)
);

Except you'll use an overlap operation instead of inequality for the one
check.

You'll need ( think...) the extension since both operators must belong to
the same class and btree doesn't have an overlaps

David J.

Continue reading on narkive:

Search results for 'Indexes for inequalities' (Questions and Answers)

replies

Can you take a look at this graph and tell me in which of these countries you would like to live?

started 2010-09-04 03:02:29 UTC

politics

replies

are inequalities towards men fair because of inequalities towards women in the past?

started 2013-05-29 23:30:45 UTC

gender studies

replies

Hi Gia huy Ho. I have a question here. How can you apply Cauchy's inequality when you haven't proved that 1/x is non-negative?