Postgres where子句超过两列 [英] Postgres where clause over two columns
问题描述
数据库-我正在使用Postgres 9.6.5
我正在分析来自美国机场管理局(RITA)的有关航班起降的数据。
此链接( http://stat-computing.org/dataexpo /2009/the-data.html )列出了表格中的所有列。
表具有以下29列
无名称说明
1年1987-2008
2个月1-12
3 DayofMonth 1-31
4 DayOfWeek 1(星期一)-7(星期日)
5实际的DepTime出发时间(本地,hhmm)
6 CRSDepTime计划的出发时间(本地,hhmm)
7实际ArrTime到达时间(本地,hhmm)
8 CRSArr时间计划到达时间(本地,hhmm)
9 UniqueCarrier unique航空公司代码
10 FlightNum航班号
11 TailNum飞机机尾号
12分钟中的ActualElapsedTime
13 CRSElapsedTime中的分钟
14 AirTime中的分钟
15 ArrDelay到达延迟,以分钟为单位tes
16 DepDelay出发延误,以分钟为单位
17始发地IATA机场代码
18目的地国际航空运输协会机场代码
19英里距离
20出租车以分钟为单位的滑行时间
21出租车以分钟为单位的滑行时间
22取消了航班?
23 CancellationCode取消原因(A =运营商,B =天气,C = NAS,D =安全)
24已转移1 =是,0 =否
25分钟内的CarrierDelay
26分钟内的WeatherDelay
27分钟内NASDelay
28分钟内SecurityDelay
29 LateAircraftDelay in minutes
每年大约有一百万行。
我试图找出延迟超过15分钟时最繁忙的机场。
列DepDelay-具有延迟时间。
起点-是机场的起点代码。
所有数据均已加载到名为 ontime的表中
我正在按以下方式形成查询分阶段进行。
-
选择延迟时间超过15分钟的机场
select origin,year,count(*)as depdelay_count from ontime
其中
depdelay> 15
按年份分组,原始
由depdelay_count desc
排序) / p> -
现在,我希望每年仅抽出前十名机场-我正在按照以下步骤进行操作
从x中选择x.origin,x.year(子查询为(
select origin,year,count(*)作为depdelay_count from ontime
其中
depdelay> 15
group按年份,由$ debdelay_count desc
产生的原始
订单
选择(根据子查询的排名,按depdelay_count desc按年顺序划分)origin,year,rank()作为x的x.rank< ; = 10; -
现在,我拥有depdelay排名前10位的机场-我希望对这些机场的总航班数进行统计。
从准时选择起点,count(),其中起点在
中(选择x.origin于(子查询为(
选择起点,year,count()as depdelay_count from ontime
where
depdelay> 15
group by year,起源的
by depdelay_count desc
)
选择origin,year,rank()作为子查询的排名(按depdelay_count desc按年级划分),其中x.rank< = 2)
按原点分组
按原点排序;
如果我通过在year子句中添加年份来修改第3步查询
----将是(1987年至2008年)的任何值。
选择来源,计数( *)from ontime,其中year =(< YEAR>)起始于
(选择x.origin from为(子查询为(
select origin,year,count(*))为ontime $ b $的depdelay_count b其中
的延迟> 15
按年分组,起源
由depdelay_count desc
排序)
选择起源,年,rank()超过(由depdelay_count desc按年级划分)作为子查询的排名)x其中x.rank< = 2)
按产地分组
按产地排序;
但是我要手动避免从1987年到2008年的所有时间里这样做。
请帮助您优化查询,以便我可以选择所有年份的数据,而不必手动选择每年。
我发现查询中间的CTE令人困惑。您基本上可以使用一个CTE /子查询来做到这一点:
,而oy为(
选择来源,年份,计数( *)作为numflights,
sum((depdelay> 15):: int)作为depdelay_count,
row_number()over(按年顺序划分sum((depdelay> 15):: int) desc)作为按时间顺序从始发地
按始发地分组的seqnum
,年份
)
选择oy。*
从oy
中,其中seqnum< = 10;
请注意使用条件聚合以及将窗口函数与聚合函数一起使用。
Database - I am working on in Postgres 9.6.5
I am analyzing the data from US Airport Authority (RITA) about the flights arrival and departures. This link (http://stat-computing.org/dataexpo/2009/the-data.html) lists all the columns in the table.
The table has following 29 columns
No Name Description
1 Year 1987-2008
2 Month 1-12
3 DayofMonth 1-31
4 DayOfWeek 1 (Monday) - 7 (Sunday)
5 DepTime actual departure time (local, hhmm)
6 CRSDepTime scheduled departure time (local, hhmm)
7 ArrTime actual arrival time (local, hhmm)
8 CRSArrTime scheduled arrival time (local, hhmm)
9 UniqueCarrier unique carrier code
10 FlightNum flight number
11 TailNum plane tail number
12 ActualElapsedTime in minutes
13 CRSElapsedTime in minutes
14 AirTime in minutes
15 ArrDelay arrival delay, in minutes
16 DepDelay departure delay, in minutes
17 Origin origin IATA airport code
18 Dest destination IATA airport code
19 Distance in miles
20 TaxiIn taxi in time, in minutes
21 TaxiOut taxi out time in minutes
22 Cancelled was the flight cancelled?
23 CancellationCode reason for cancellation (A = carrier, B = weather, C = NAS, D = security)
24 Diverted 1 = yes, 0 = no
25 CarrierDelay in minutes
26 WeatherDelay in minutes
27 NASDelay in minutes
28 SecurityDelay in minutes
29 LateAircraftDelay in minutes
There are about a million rows for each year.
I am trying to find out a count the most busy airports when delay is more than 15minutes. column DepDelay - has the delay time. origin - is the origin code for the airport.
All the data has been loaded into a table called 'ontime'
I am forming the query as follows in stages.
select airports where delay is more than 15 minutes
select origin,year,count(*) as depdelay_count from ontime where depdelay > 15
group by year,origin order by depdelay_count desc )Now I wish to pull out only the top 10 airports per year - which I am doing as follows
select x.origin,x.year from (with subquery as ( select origin,year,count(*) as depdelay_count from ontime where depdelay > 15 group by year,origin order by depdelay_count desc ) select origin,year,rank() over (partition by year order by depdelay_count desc) as rank from subquery) x where x.rank <= 10;
Now that I have the top 10 airports by depdelay - I wish to get a count of the total flights out of these airports.
select origin,count() from ontime where origin in (select x.origin from (with subquery as ( select origin,year,count() as depdelay_count from ontime where depdelay > 15 group by year,origin order by depdelay_count desc ) select origin,year,rank() over (partition by year order by depdelay_count desc) as rank from subquery) x where x.rank <= 2) group by origin order by origin;
If I modify the Step 3 query by adding the year in the year clause
---- will be any value from (1987 to 2008)
select origin,count(*) from ontime where year = (<YEAR>) origin in
(select x.origin from (with subquery as (
select origin,year,count(*) as depdelay_count from ontime
where
depdelay > 15
group by year,origin
order by depdelay_count desc
)
select origin,year,rank() over (partition by year order by depdelay_count desc) as rank from subquery) x where x.rank <= 2)
group by origin
order by origin;
But I have to do this manually for all years from 1987 to 2008 which I want to avoid.
Please can you help refine the query so that I can select the data for all the years without having to select each year manually.
I find CTEs int he middle of queries to e confusing. You can basically do this with one CTE/subquery:
with oy as (
select origin, year, count(*) as numflights,
sum( (depdelay > 15)::int ) as depdelay_count,
row_number() over (partition by year order by sum( (depdelay > 15)::int ) desc) as seqnum
from ontime
group by origin, year
)
select oy.*
from oy
where seqnum <= 10;
Note the use of conditional aggregation and using window functions with aggregation functions.
这篇关于Postgres where子句超过两列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!