PostgreSQL中的优化查询 [英] Optimized querying in PostgreSQL
问题描述
假设您有一个名为tracker的表,其中包含以下记录。
issue_id | ingest_date |动词,状态
10 2015-01-24 00:00:00 1,1
10 2015-01-25 00:00:00 2,2
10 2015-01-26 00 :00:00 2,3
10 2015-01-27 00:00:00 3,4
11 2015-01-10 00:00:00 1,3
11 2015- 01-11 00:00:00 2,4
我需要以下结果
10 2015-01-26 00:00:00 2,3
11 2015-01-11 00:00:00 2, 4
我正在尝试此查询
select *
from etl_change_fact
其中ingest_date =(从etl_change_fact中选择max(ingest_date)
);
但是,这只给我
10 2015-01-26 00:00:00 2,3
此记录。
但是,我希望所有具有
(a)max的唯一记录(change_id) (ingest_date)AND
(b)动词列优先级为(2-第一个首选,1-第二个首选,3-最后一个首选)
因此,我需要以下结果
10 2015-01-26 00:00:00 2 ,3
11 2015-01-11 00:00:00 2,4
请帮助我高效地查询它。
PS:
我不为ingest_date编制索引,因为我将在Distributed Computing设置中将其设置为 distribution key 。
我是Data Warehouse和查询的新手。
因此,请以优化的方式帮助我达到TB大小的数据库。
这是一个典型的最大组问题。如果您在此处搜索此标签,则将获得很多解决方案-包括MySQL。
对于Postgres,最快的方法是使用在
(这是对SQL语言的Postgres专有扩展)上
b
select on on(issue_id )issue_id,ingest_date,动词,状态
从etl_change_fact
按issue_id,
顺序动词
在2下然后1
在1下然后2
否则3
结尾,ingest_date说明;
您可以增强原始查询以使用共同相关的子查询来实现相同的目的:
选择f1。*
from etl_change_fact f1
其中f1.ingest_date =(选择max(f2。 ingest_date)来自etl_change_fact f2
的
,其中f1.issue_id = f2.issue_id);
编辑
对于过时且不受支持的Postgres版本,您可能可以使用以下方法逃脱:
select f1。*
from etl_change_fact f1
其中f1.ingest_date =(选择f2.ingest_date
from etl_change_fact f2
其中f1.issue_id = f2.issue_id
时按动词
排序,然后2然后1
当1然后2
否则3
结尾,ingest_date desc
限制1);
SQLFiddle示例: http://sqlfiddle.com/#!15/3bb05/1
Assume you have a table named tracker with following records.
issue_id | ingest_date | verb,status
10 2015-01-24 00:00:00 1,1
10 2015-01-25 00:00:00 2,2
10 2015-01-26 00:00:00 2,3
10 2015-01-27 00:00:00 3,4
11 2015-01-10 00:00:00 1,3
11 2015-01-11 00:00:00 2,4
I need the following results
10 2015-01-26 00:00:00 2,3
11 2015-01-11 00:00:00 2,4
I am trying out this query
select *
from etl_change_fact
where ingest_date = (select max(ingest_date)
from etl_change_fact);
However, this gives me only
10 2015-01-26 00:00:00 2,3
this record.
But, I want all unique records(change_id) with
(a) max(ingest_date) AND
(b) verb columns priority being (2 - First preferred ,1 - Second preferred ,3 - last preferred)
Hence, I need the following results
10 2015-01-26 00:00:00 2,3
11 2015-01-11 00:00:00 2,4
Please help me to efficiently query it.
P.S : I am not to index ingest_date because I am going to set it as "distribution key" in Distributed Computing setup. I am newbie to Data Warehouse and querying.
Hence, please help me with optimized way to hit my TB sized DB.
This is a typical "greatest-n-per-group" problem. If you search for this tag here, you'll get plenty of solutions - including MySQL.
For Postgres the quickest way to do it is using distinct on
(which is a Postgres proprietary extension to the SQL language)
select distinct on (issue_id) issue_id, ingest_date, verb, status
from etl_change_fact
order by issue_id,
case verb
when 2 then 1
when 1 then 2
else 3
end, ingest_date desc;
You can enhance your original query to use a co-related sub-query to achieve the same thing:
select f1.*
from etl_change_fact f1
where f1.ingest_date = (select max(f2.ingest_date)
from etl_change_fact f2
where f1.issue_id = f2.issue_id);
Edit
For an outdated and unsupported Postgres version, you can probably get away using something like this:
select f1.*
from etl_change_fact f1
where f1.ingest_date = (select f2.ingest_date
from etl_change_fact f2
where f1.issue_id = f2.issue_id
order by case verb
when 2 then 1
when 1 then 2
else 3
end, ingest_date desc
limit 1);
SQLFiddle example: http://sqlfiddle.com/#!15/3bb05/1
这篇关于PostgreSQL中的优化查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!