标记不连续的日期范围 [英] Mark non-contiguous date ranges

查看:138
本文介绍了标记不连续的日期范围的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

全局历史气候学网络已在其天气测量结果中标记了无效或错误的数据。删除这些元素后,有大量数据不再具有连续的日期部分。数据类似于:

The Global Historical Climatology Network has flagged invalid or erroneous data in its collection of weather measurements. After removing these elements, there are swaths of data that no longer have contiguously dated sections. The data resembles:

"2007-12-01";14 -- Start of December
"2007-12-29";8
"2007-12-30";11
"2007-12-31";7
"2008-01-01";8 -- Start of January
"2008-01-02";12
"2008-01-29";0
"2008-01-31";7
"2008-02-01";4 -- Start of February
... entire month is complete ...
"2008-02-29";12
"2008-03-01";14  -- Start of March
"2008-03-02";17
"2008-03-05";17



问题(输出)



尽管可以推断缺失的数据(例如,通过对其他年份的平均值)以提供连续范围,以简化系统,我想根据是否有连续的日期范围来标记不连续的细分来填充月份:

Problem (Output)

Although possible to extrapolate missing data (e.g., by averaging from other years) to provide contiguous ranges, to simplify the system, I want to flag the non-contiguous segments based on whether there is a contiguous range of dates to fill the month:

D;"2007-12-01";14 -- Start of December
D;"2007-12-29";8
D;"2007-12-30";11
D;"2007-12-31";7
D;"2008-01-01";8 -- Start of January
D;"2008-01-02";12
D;"2008-01-29";0
D;"2008-01-31";7
"2008-02-01";4 -- Start of February
... entire month is complete ...
"2008-02-29";12
D;"2008-03-01";14  -- Start of March
D;"2008-03-02";17
D;"2008-03-05";17

在1843年进行了一些测量。

Some measurements were taken in the year 1843.

对于所有气象站,您如何标记月份中缺少一天或几天的所有日子?

For all weather stations, how would you mark all the days in months that are missing one or more days?

用于选择数据的代码类似于:

The code to select the data resembles:

select
  m.id,
  m.taken,
  m.station_id,
  m.amount
from
  climate.measurement



相关构想



生成一个包含连续日期的表格并将其与测得的数据进行比较日期。

Related Ideas

Generate a table filled with contiguous dates and compare them to the measured data dates.

  • What is the most straightforward way to pad empty dates in sql results (on either mysql or perl end)?
  • How do I group on continuous ranges
  • http://msdn.microsoft.com/en-us/library/aa175780%28v=sql.80%29.aspx

可以使用重新创建问题

表创建如下:

CREATE TABLE climate.calendar
(
  id serial NOT NULL,
  n character varying(2) NOT NULL,
  d date NOT NULL,
  "valid" boolean NOT NULL DEFAULT true,
  CONSTRAINT calendar_pk PRIMARY KEY (id)
)
WITH (
  OIDS=FALSE
);



生成数据



以下SQL插入数据到表中( id [int], n ame [varchar], d ate [date], valid [boolean]):

Generate Data

The following SQL inserts data into a table (id [int], name [varchar], date [date], valid [boolean]):

insert into climate.calendar (n, d) 
    select 'A', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
    from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d) 
    select 'B', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
    from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d) 
    select 'C', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
    from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d) 
    select 'D', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
    from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d) 
    select 'E', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
    from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d) 
    select 'F', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
    from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n

'A''F'的值代表使

删除某些行,如下所示:

Delete some rows as follows:

delete from climate.calendar where id in (select id from climate.calendar order by random() limit 5000);



尝试#1



以下内容在一个月中缺少某天或几天的所有天数内,请勿将 valid 标志切换为 false

UPDATE climate.calendar
SET valid = false
WHERE date_trunc('month', d) IN (
    SELECT DISTINCT date_trunc('month', d)
    FROM climate.calendar A
    WHERE NOT EXISTS (
        SELECT 1
        FROM climate.calendar B
        WHERE A.d - 1 = B.d
   )
);



尝试#2



以下SQL产生一个空结果集:

Attempt #2

The following SQL produces an empty result set:

with gen_calendar as (
    select (date('1982-01-1') + (n || ' days')::interval)::date cal_date
    from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
)
select gc.cal_date
from gen_calendar gc
left join climate.calendar c on c.d = gc.cal_date
where c.d is null;



尝试#3



以下SQL生成电台名称和日期的所有可能组合:

Attempt #3

The following SQL generates all the possible combinations of station names and dates:

select
  distinct( cc.n ), t.d
from
  climate.calendar cc,
  (
    select (date('1982-01-1') + (n || ' days')::interval)::date d
    from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
  ) t
order by
  cc.n

但是,在实际数据中,有几百个站点,日期可追溯到1800年代中期,因此所有测站的所有日期的笛卡尔坐标太大。如果有足够的时间,这种方法可能行得通...必须有一种更快的方法。

However, in the real data there are several hundred stations and the dates go back to the mid 1800s, so the Cartesian of all dates for all stations is too large. Such an approach might work, given enough time... There must be a faster way.

PostgreSQL具有窗口功能。

PostgreSQL has windowing functions.

如何使用postgres中的窗口功能选择特定更改

谢谢!

推荐答案

generate_series()



PostgreSQL的 generate_series()函数可以创建包含连续日期列表的视图:

generate_series()

PostgreSQL's generate_series() function can create a view that contains a consecutive list of dates:

with calendar as (
    select ((select min(date) from test)::date + (n || ' days')::interval)::date cal_date
    from generate_series(0, (select max(date) - min(date) from test)) n
)
select cal_date
from calendar c
left join test t on t.date = c.cal_date
where t.date is null;

表达式从测试中选择max(date)-min(date)可能相差一个。

一种识别方法无效月份是创建两个视图。第一个计算每个站点每个月应产生的每日读数数量。 (请注意,将 climate.calendar 转换为 climate_calendar 。)秒数返回每个电台每月产生的实际每日读数。 。

One way to identify invalid months is to create two views. The first counts the number of daily readings each station should produce in each month. (Note that climate.calendar is translated to climate_calendar.) The second returns the actual daily readings each station produced per month.

此视图将返回一个月中的实际天数,每站。 (例如,二月总是28天或29天。)

This view will return the actual number of days in a month, per station. (For example, February will always have either 28 or 29 days.)

create view count_max_station_calendar_days as 
with calendar as (
    select ((select min(d) from climate_calendar)::date + (n || ' days')::interval)::date cal_date
    from generate_series(0, (select max(d) - min(d) from climate_calendar)) n
)
select n, extract(year from cal_date) yr, extract(month from cal_date) mo, count(*) num_days
from stations cross join calendar
group by n, yr, mo
order by n, yr, mo



每个站点每月的实际天数



返回的总天数少于计数。 (例如,一月将总是少于31天。)

Actual Days Per Month Per Station

The total number of days returned will be fewer than the tallies. (For example, January will always have 31 days or fewer.)

create view count_actual_station_calendar_days as
select n, extract(year from d) yr, extract(month from d) mo, count(*) num_days
from climate_calendar
group by n, yr, mo
order by n, yr, mo;

在生产中删除 ORDER BY 子句(

加入这两个视图以标识站点和月份

Join the two views to identify the stations and months that need to be flagged, into a new view:

create view invalid_station_months as 
select m.n, m.yr, m.mo, m.num_days - a.num_days num_days_missing
from count_max_station_calendar_days m
inner join count_actual_station_calendar_days a
       on (m.n = a.n and m.yr = a.yr and m.mo = a.mo and m.num_days <> a.num_days)

n   yr    mo  num_days_missing
--
A   1982  1   1
E   2007  3   1

num_days_missing 不是必需的,但很有用

The column num_days_missing is not necessary, but it is useful.

这些是需要更新的行:

select cc.* 
from climate_calendar cc
inner join invalid_station_months im 
        on (cc.n = im.n and 
            extract(year from cc.d) = im.yr and
            extract(month from cc.d) = im.mo)
where valid = true



更新数据库



要更新它们, id 键很方便。

update climate_calendar
set valid = false
where id in (
    select id
    from climate_calendar cc
    inner join invalid_station_months im 
        on (cc.n = im.n and 
            extract(year from cc.d) = im.yr and
            extract(month from cc.d) = im.mo)
    where valid = true
);

这篇关于标记不连续的日期范围的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆