有效地检查是否存在任何给定范围内的值 [英] Efficiently check if value is present in any of given ranges
问题描述
-
A
'开始'
和'完成'
列 - p>
B
有列'date'
目标是有效地创建一个布尔蒙版,指示 date
是否在 [start,finish] / code>间隔
天真的迭代花费太多时间,我想有一种方法可以更快地执行
更新:
A
和 B
有不同行数
更新2:
示例:
A
|开始|完成|
| ------- | -------- |
| 1 | 3 |
| 50 | 83 |
| 30 | 42 |
B
|日期|
| ------- |
| 31 |
| 20 |
| 2.5 |
| 84 |
| 1000 |
输出:
| in_interval |
| ------- |
|真|
|假的
|真|
|假的
|假的
我以datetime格式存在我的数据,但我猜想解决方案与数字不一样。
你可以这样做一个O(n)复杂度。这个想法是转换表示。在A中,每个间隔存储一行。我建议一个数据帧,每个转换存储一行(即输入一个间隔,留下一个间隔)。
A = pd。 DataFrame(
data = {
'start':[1,50,30],
'finish':[3,83,42]
}
)
starts = pd.DataFrame(data = {'start':1},index = A.start.tolist())
finishs = pd.DataFrame(data = {'finish' :-1},index = A.finish.tolist())
transitions = pd.merge(starts,finishs,how ='outer',left_index = True,right_index = True).fillna(0)
转换
开始完成
1 1 0
3 0 -1
30 1 0
42 0 -1
50 1 0
83 0 -1
这个数据帧存储每个日期的转换类型。现在,我们需要在每个日期知道我们是否在一个间隔内。看起来像开右括号。你可以做:
transitions ['transition'] =(transitions.pop('finish')+ transitions.pop开始'))。cumsum()
过渡
转换
1 1
3 0
30 1
42 0
50 1
83 0
这里说:
- 在1,我在一个间隔
- 在3,我不是
- 一般来说,如果该值严格大于0,则它处于一个间隔。
- 请注意,这将处理重叠间隔
现在,您将与B数据框合并:
B = pd.DataFrame(
index = [31,20,2.5,84,1000]
)
pd.merge(transitions,B,how ='outer',left_index = True,right_index = True) fillna(method ='ffill')。loc [B.index] .astype(bool)
转换
31.0 True
20.0 False
2.5 True
84.0 False
1000.0 False
I have two pandas DataFrame objects:
A
contains'start'
and'finish'
columnsB
has column'date'
The goal is to efficiently create a boolean mask indicating if date
is in [start, finish]
interval
Naive iterating taking too much time, I guess there is a method to do that faster
UPDATE:
A
and B
have different number of rows
UPDATE2: Sample:
A
| start | finish |
|------- |-------- |
| 1 | 3 |
| 50 | 83 |
| 30 | 42 |
B
| date |
|------- |
| 31 |
| 20 |
| 2.5 |
| 84 |
| 1000 |
Output:
| in_interval |
|------- |
| True |
| False |
| True |
| False |
| False |
P.S. I have my data in datetime format but I guess that the solution will not differ from one for numbers
You can do it with a O(n) complexity. The idea is to transform the representation. In A, you store one row per interval. I would suggest a dataframe which stores one row per transition (ie entering an interval, leaving an interval).
A = pd.DataFrame(
data={
'start': [1, 50, 30],
'finish': [3, 83, 42]
}
)
starts = pd.DataFrame(data={'start': 1}, index=A.start.tolist())
finishs = pd.DataFrame(data={'finish': -1}, index=A.finish.tolist())
transitions = pd.merge(starts, finishs, how='outer', left_index=True, right_index=True).fillna(0)
transitions
start finish
1 1 0
3 0 -1
30 1 0
42 0 -1
50 1 0
83 0 -1
this dataframe stores per date the type of transitions. Now, we need to know at each date if we are in an interval or not. It looks like counting the opening & closing parenthesis. You can do:
transitions['transition'] = (transitions.pop('finish') + transitions.pop('start')).cumsum()
transitions
transition
1 1
3 0
30 1
42 0
50 1
83 0
Here it says:
- At 1, i'm in an interval
- At 3, i'm not
- In general, if the value is strictly greater than 0, it's in an interval.
- Note that this handles overlapping interval
And now you merge with your B dataframe:
B = pd.DataFrame(
index=[31, 20, 2.5, 84, 1000]
)
pd.merge(transitions, B, how='outer', left_index=True, right_index=True).fillna(method='ffill').loc[B.index].astype(bool)
transition
31.0 True
20.0 False
2.5 True
84.0 False
1000.0 False
这篇关于有效地检查是否存在任何给定范围内的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!