Paul Linehan
2018-04-30 14:25:53 UTC
I have a tricky piece of SQL that's bothering me.
I have a table called illness - it's oncology so these are all tumours.
Let's say we have cervical and lung and some months we might have more
lung tumours than cervical and other months it's vice versa!
We record them as follows (full DDL and DML at end of post)
nature_of_illness VARCHAR(25), created_at TIMESTAMP (UTC) - these
dates and times are essentially random - tests can be returned by
international labs at any time of the day or night.
Now, what I want is to find which illness was reported most in a given
month and in a given year.
The records returned should look like
nature_of_illness year month count (count has to be the MAX
out of all illnesses for a given month)
lung 2017 January 53 - i.e. in January
2017, the most reported illness was lung tumours of which there were
53
cervix 2017 February 45
..
&c...
I have an added constraint - this must work using "old" SQL - i.e.
with MySQL 5.6 as well as PostgreSQL 10 - no Windows/Analytics
functions &c.
I have got this far:
-- SELECT DISTINCT(i_data), il_mc) -- commented out pieces show other
failed attempts!
-- FROM
-- (
SELECT
c_year,
-- c_month, -- just provides a number - how do I convert an ::int
(say 1) to 'January'
CASE
WHEN c_month = 1 THEN 'January'
WHEN c_month = 2 THEN 'February'
-- then March... &c...
END AS the_month,
-- SELECT to_char(to_timestamp (c_month::text, 'MM'), 'TMmon') --
tried variants of this, want better than CASE
MAX(month_count) AS il_mc
FROM
(
SELECT nature_of_illness as illness,
EXTRACT(YEAR FROM created_at) AS c_year,
EXTRACT(MONTH FROM created_at) AS c_month,
COUNT(EXTRACT(MONTH FROM created_at)) AS month_count
FROM illness
GROUP BY illness, EXTRACT(YEAR FROM created_at), EXTRACT(MONTH FROM
created_at)
ORDER BY EXTRACT(MONTH FROM created_at)
) t1
GROUP BY c_year, c_month
ORDER BY c_year, c_month
which gives (sample data DML below)
c_year the_month il_mc
2017 January 1
2017 February 3
2018 January 5
2018 February 3
These figures are correct with the sample data.
What I'm finding difficult is to include the most common illness name
text for a given month! And what do I do in the case of ties (as for
January 2017 in sample data)? I imagine that my ties scenario will
require Analytic functions? DENSE_RANK or similar? I'd be grateful for
a non-Analytic solution and an Analytic one. Any discussion,
references, URLs or other helpful data most appreciated - I really
want to grasp what's going on here! I think I need to JOIN on my
derived table(s) but can't figure it!
If anything extra is required, please let me know!
Rgs,
Pól
DDL and DML -------------------------
CREATE table illness (nature_of_illness VARCHAR(25), created_at TIMESTAMP);
INSERT INTO illness VALUES ('Cervix', '2018-01-03 15:45:40');
INSERT INTO illness VALUES ('Cervix', '2018-01-03 15:45:40');
INSERT INTO illness VALUES ('Cervix', '2018-01-03 15:45:40');
INSERT INTO illness VALUES ('Cervix', '2018-01-03 15:45:40');
INSERT INTO illness VALUES ('Cervix', '2018-01-03 15:45:40');
INSERT INTO illness VALUES ('Lung', '2018-01-03 17:50:32');
INSERT INTO illness VALUES ('Lung', '2018-02-03 17:50:32');
INSERT INTO illness VALUES ('Lung', '2018-02-03 17:50:32');
INSERT INTO illness VALUES ('Lung', '2018-02-03 17:50:32');
INSERT INTO illness VALUES ('Cervix', '2018-02-03 17:50:32');
INSERT INTO illness VALUES ('Cervix', '2017-01-03 15:45:40'); -- one
of each for Jan 2017
INSERT INTO illness VALUES ('Lung', '2017-01-03 17:50:32'); -- one
of each for Jan 2017 - dealing with ties?
INSERT INTO illness VALUES ('Lung', '2017-02-03 17:50:32');
INSERT INTO illness VALUES ('Lung', '2017-02-03 17:50:32');
INSERT INTO illness VALUES ('Lung', '2017-02-03 17:50:32');
INSERT INTO illness VALUES ('Cervix', '2017-02-03 17:50:32');
I have a table called illness - it's oncology so these are all tumours.
Let's say we have cervical and lung and some months we might have more
lung tumours than cervical and other months it's vice versa!
We record them as follows (full DDL and DML at end of post)
nature_of_illness VARCHAR(25), created_at TIMESTAMP (UTC) - these
dates and times are essentially random - tests can be returned by
international labs at any time of the day or night.
Now, what I want is to find which illness was reported most in a given
month and in a given year.
The records returned should look like
nature_of_illness year month count (count has to be the MAX
out of all illnesses for a given month)
lung 2017 January 53 - i.e. in January
2017, the most reported illness was lung tumours of which there were
53
cervix 2017 February 45
..
&c...
I have an added constraint - this must work using "old" SQL - i.e.
with MySQL 5.6 as well as PostgreSQL 10 - no Windows/Analytics
functions &c.
I have got this far:
-- SELECT DISTINCT(i_data), il_mc) -- commented out pieces show other
failed attempts!
-- FROM
-- (
SELECT
c_year,
-- c_month, -- just provides a number - how do I convert an ::int
(say 1) to 'January'
CASE
WHEN c_month = 1 THEN 'January'
WHEN c_month = 2 THEN 'February'
-- then March... &c...
END AS the_month,
-- SELECT to_char(to_timestamp (c_month::text, 'MM'), 'TMmon') --
tried variants of this, want better than CASE
MAX(month_count) AS il_mc
FROM
(
SELECT nature_of_illness as illness,
EXTRACT(YEAR FROM created_at) AS c_year,
EXTRACT(MONTH FROM created_at) AS c_month,
COUNT(EXTRACT(MONTH FROM created_at)) AS month_count
FROM illness
GROUP BY illness, EXTRACT(YEAR FROM created_at), EXTRACT(MONTH FROM
created_at)
ORDER BY EXTRACT(MONTH FROM created_at)
) t1
GROUP BY c_year, c_month
ORDER BY c_year, c_month
which gives (sample data DML below)
c_year the_month il_mc
2017 January 1
2017 February 3
2018 January 5
2018 February 3
These figures are correct with the sample data.
What I'm finding difficult is to include the most common illness name
text for a given month! And what do I do in the case of ties (as for
January 2017 in sample data)? I imagine that my ties scenario will
require Analytic functions? DENSE_RANK or similar? I'd be grateful for
a non-Analytic solution and an Analytic one. Any discussion,
references, URLs or other helpful data most appreciated - I really
want to grasp what's going on here! I think I need to JOIN on my
derived table(s) but can't figure it!
If anything extra is required, please let me know!
Rgs,
Pól
DDL and DML -------------------------
CREATE table illness (nature_of_illness VARCHAR(25), created_at TIMESTAMP);
INSERT INTO illness VALUES ('Cervix', '2018-01-03 15:45:40');
INSERT INTO illness VALUES ('Cervix', '2018-01-03 15:45:40');
INSERT INTO illness VALUES ('Cervix', '2018-01-03 15:45:40');
INSERT INTO illness VALUES ('Cervix', '2018-01-03 15:45:40');
INSERT INTO illness VALUES ('Cervix', '2018-01-03 15:45:40');
INSERT INTO illness VALUES ('Lung', '2018-01-03 17:50:32');
INSERT INTO illness VALUES ('Lung', '2018-02-03 17:50:32');
INSERT INTO illness VALUES ('Lung', '2018-02-03 17:50:32');
INSERT INTO illness VALUES ('Lung', '2018-02-03 17:50:32');
INSERT INTO illness VALUES ('Cervix', '2018-02-03 17:50:32');
INSERT INTO illness VALUES ('Cervix', '2017-01-03 15:45:40'); -- one
of each for Jan 2017
INSERT INTO illness VALUES ('Lung', '2017-01-03 17:50:32'); -- one
of each for Jan 2017 - dealing with ties?
INSERT INTO illness VALUES ('Lung', '2017-02-03 17:50:32');
INSERT INTO illness VALUES ('Lung', '2017-02-03 17:50:32');
INSERT INTO illness VALUES ('Lung', '2017-02-03 17:50:32');
INSERT INTO illness VALUES ('Cervix', '2017-02-03 17:50:32');