30000个数据点,发现2周以上的时间变化最大 [英] 30,000 data points, find greatest change over 2 weeks' time

查看:102
本文介绍了30000个数据点,发现2周以上的时间变化最大的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有:

- 30,000 data points
- each data point is a measurement of type float
- each measurement is associated with a date
- each date has only one measurement
- no dates are without measurements
- the data comes in the form of a text file: 30,000 lines in this form:
    - YYYY-MM-DD I,F (e.g. 1977-02-08 20.74)
- measurement appearing in the source file are already sorted by date

我需要:

- a time-interval T with boundaries (s,e) /* start, end */
- (s - e = 14 days) the time-interval *must* be 2 weeks
- define min as the lowest value in the interval T
- define max as the greatest value in the interval T
- the chosen T needs to have the greatest distance btwn max and min of all possible Ts
- break ties among intervals T by choosing the most recent (with the greatest s value)
- the chosen T must consider all jumps in the 14 days, not just the values @ s and e
- if the overall "variance" in the interval is great but the jump 
  |max-min| is not the greatest in absolute value, T is not the right choice,
  even if it's an "exciting" interval

我问:

- which algorithm to employ, considering algorithms are not my specialty
- which data structure to use to keep track of the subtotals

请注意:

- an answer in pseudo code would be preferred, "prose" is fine if pressured for time
- an answer in Python would be... splendid :)

如果你愿意,你可以生成虚拟的数据,并运行你的算法作为测试或者我可以分享的实际数据。

If you want, you can generate "dummy" data and run your proposed algorithm as a test or I could share the actual data.

我不关心性能这么多在这里除了想知道要做到这一点,以学习如何运用正确的解决方案和正确的算法,以最快的方式。

I am not concerned with performance so much here apart from wanting to know the fastest way to do this so as to learn how to apply the right solution and the correct algorithm.

我想我能证明的正确性与即使是最简单的迭代算法,因为数据集是小在今天的电脑。

I think I can "prove" correctness with even the simplest iterative algorithm because the dataset is small given today's computers.

到目前为止,我在运行和沿线14测量14载体携带,如果你能教我如何与子和数逐步做到这一点,那将是真正的AP preciated。

So far, I am "traversing and carrying along 14 vectors of 14 measurements", if you could teach me how to do this incrementally with sub-sums, that would be really appreciated.

推荐答案

滑动窗实际上已在这里工作,通过保持两个堆栈(也许这是一个有点误导,因为这可能是最好的,因为一个双端队列实施)。随身携带一本叠 minstack 和一叠名为 MAXSTACK 。该算法的关键是,minstack应严格非减并MAXSTACK应严格非增在幻灯片的所有点。那么,我们该怎么做呢?

Sliding windows do actually work here, by keeping two stacks (perhaps this is a little misleading, as this is probably best implemented as a doubly-ended queue). Keep a stack minstack and a stack called maxstack. The crux of the algorithm is that minstack should be strictly non-decreasing and maxstack should be strictly non-increasing at all points of the slide. So, how do we do that?

首先,添加的第一个14个点到堆栈。让我们来定义添加(点)为:

First, add the first 14 points to a stack. Let's define add(point) as:

做到这一点的minstack:

Do this for the minstack:

  • 当点比minstack顶元素小,去除minstack的顶级元素。
  • 添加点到minstack。

类似地,对于MAXSTACK:

Similarly, for the maxstack:

  • 当新的点比MAXSTACK的顶级元素较大,除去MAXSTACK的顶级元素。
  • 添加点到MAXSTACK。

由于上述特性,所述第一14个元件的最小和最大应minstack和MAXSTACK的底部元件。现在滑动窗口。我们只需要注意,如果左边的点仍然是在任何堆叠的活着,这是必然,现在的底部点。因此,这应该很容易,它只是:

Due to the property above, the min and max of the first 14 elements should be the bottom elements of minstack and maxstack. Now slide the window. We simply have to note that if the left point is still "alive" in any of the stacks, it's necessarily now the bottom point. Therefore this should be easy, it's simply:

slide():
    add(new_point)
    if (left_point == bottom(minstack)) remove_bottom(minstack)
    if (left_point == bottom(maxstack)) remove_bottom(maxstack)

这样做,直到你的观点被耗尽。你要找的间隔是在其中底部(MAXSTACK) - 下(minstack)是最大的。

请注意,任何点进入minstack / MAXSTACK最多一次,每点离开堆栈至多一次为好,因此这并不至多4操作的每个点,无论是什么所需的间隔的尺寸为。

Note that any point enters minstack/maxstack at most once, and every point leaves the stacks at most once as well, therefore this does at most 4 operations for each point, no matter what the size of the desired interval is.

编辑:我刚才注意到你想在Python的实现。我真的不希望解析数据,所以该功能将值作为输入的列表,并且在阵列输出指数(S,E):

I just noticed you wanted an implementation in Python. I didn't really want to parse the data, so the function takes a list of values as input, and outputs the indices (s,e) in that array:

import collections

def add(x, minstack, maxstack):
    while minstack and x < minstack[-1]: minstack.pop()
    while maxstack and x > maxstack[-1]: maxstack.pop()
    minstack.append(x)
    maxstack.append(x)

def get_largest_interval(points):
    minstack = collections.deque()
    maxstack = collections.deque()

    best_diff = -1
    best_interval = None

    for index, elem in enumerate(points):
        add(elem,minstack,maxstack)
        if index >= 14:
            if minstack[0] == points[index-14]: minstack.popleft()
            if maxstack[0] == points[index-14]: maxstack.popleft()

        if index >= 13:
            this_diff = maxstack[0]-minstack[0]
            if best_diff == -1 or this_diff >= best_diff:
                best_interval = (index-13, index)
                best_diff = this_diff

    return best_interval


print get_largest_interval([0, 2, 2,2,2,2,2,2,2,2,2,2,2,2,3])

这篇关于30000个数据点,发现2周以上的时间变化最大的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆