有效地检查是否存在任何给定范围内的值 [英] Efficiently check if value is present in any of given ranges

查看:152
本文介绍了有效地检查是否存在任何给定范围内的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个大熊猫DataFrame对象:




  • A '开始''完成'


  • p> B 有列'date'




目标是有效地创建一个布尔蒙版,指示 date 是否在 [start,finish] / code>间隔



天真的迭代花费太多时间,我想有一种方法可以更快地执行



更新:
A B 不同行数



更新2:
示例:

  A 
|开始|完成|
| ------- | -------- |
| 1 | 3 |
| 50 | 83 |
| 30 | 42 |

B
|日期|
| ------- |
| 31 |
| 20 |
| 2.5 |
| 84 |
| 1000 |

输出:
| in_interval |
| ------- |
|真|
|假的
|真|
|假的
|假的

我以datetime格式存在我的数据,但我猜想解决方案与数字不一样。

解决方案

你可以这样做一个O(n)复杂度。这个想法是转换表示。在A中,每个间隔存储一行。我建议一个数据帧,每个转换存储一行(即输入一个间隔,留下一个间隔)。

  A = pd。 DataFrame(
data = {
'start':[1,50,30],
'finish':[3,83,42]
}


starts = pd.DataFrame(data = {'start':1},index = A.start.tolist())
finishs = pd.DataFrame(data = {'finish' :-1},index = A.finish.tolist())
transitions = pd.merge(starts,finishs,how ='outer',left_index = True,right_index = True).fillna(0)
转换

开始完成
1 1 0
3 0 -1
30 1 0
42 0 -1
50 1 0
83 0 -1

这个数据帧存储每个日期的转换类型。现在,我们需要在每个日期知道我们是否在一个间隔内。看起来像开右括号。你可以做:

  transitions ['transition'] =(transitions.pop('finish')+ transitions.pop开始'))。cumsum()
过渡

转换
1 1
3 0
30 1
42 0
50 1
83 0

这里说:




  • 在1,我在一个间隔

  • 在3,我不是

  • 一般来说,如果该值严格大于0,则它处于一个间隔。

  • 请注意,这将处理重叠间隔



现在,您将与B数据框合并:

  B = pd.DataFrame(
index = [31,20,2.5,84,1000]


pd.merge(transitions,B,how ='outer',left_index = True,right_index = True) fillna(method ='ffill')。loc [B.index] .astype(bool)

转换
31.0 True
20.0 False
2.5 True
84.0 False
1000.0 False


I have two pandas DataFrame objects:

  • A contains 'start' and 'finish' columns

  • B has column 'date'

The goal is to efficiently create a boolean mask indicating if date is in [start, finish] interval

Naive iterating taking too much time, I guess there is a method to do that faster

UPDATE: A and B have different number of rows

UPDATE2: Sample:

A
    | start     | finish    |
    |-------    |--------   |
    | 1         | 3         |
    | 50        | 83        |
    | 30        | 42        |

B
    | date      | 
    |-------    |
    | 31        | 
    | 20        | 
    | 2.5       |
    | 84        |
    | 1000      |

Output:
            | in_interval | 
            |-------    |
            | True      | 
            | False     | 
            | True      |
            | False     |
            | False     |

P.S. I have my data in datetime format but I guess that the solution will not differ from one for numbers

解决方案

You can do it with a O(n) complexity. The idea is to transform the representation. In A, you store one row per interval. I would suggest a dataframe which stores one row per transition (ie entering an interval, leaving an interval).

A = pd.DataFrame(
    data={
        'start': [1, 50, 30],
        'finish': [3, 83, 42]    
    }
)

starts = pd.DataFrame(data={'start': 1}, index=A.start.tolist())
finishs = pd.DataFrame(data={'finish': -1}, index=A.finish.tolist())
transitions = pd.merge(starts, finishs, how='outer', left_index=True, right_index=True).fillna(0)
transitions

    start  finish
1       1       0
3       0      -1
30      1       0
42      0      -1
50      1       0
83      0      -1

this dataframe stores per date the type of transitions. Now, we need to know at each date if we are in an interval or not. It looks like counting the opening & closing parenthesis. You can do:

transitions['transition'] = (transitions.pop('finish') + transitions.pop('start')).cumsum()
transitions

    transition
1            1
3            0
30           1
42           0
50           1
83           0

Here it says:

  • At 1, i'm in an interval
  • At 3, i'm not
  • In general, if the value is strictly greater than 0, it's in an interval.
  • Note that this handles overlapping interval

And now you merge with your B dataframe:

B = pd.DataFrame(
    index=[31, 20, 2.5, 84, 1000]
)

pd.merge(transitions, B, how='outer', left_index=True, right_index=True).fillna(method='ffill').loc[B.index].astype(bool)

       transition
31.0         True
20.0        False
2.5          True
84.0        False
1000.0      False

这篇关于有效地检查是否存在任何给定范围内的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆