如何检查序列中的值是否包含在定义为DataFrame的行的任何间隔中? [英] How can I check if the values in a series are contained in any of the intervals defined the rows of a DataFrame?
问题描述
我意识到我的头衔有些混乱,但是如果我们以身作则,我想可以更清楚一些.我要做的是向量化测试,以检查给定系列中的任何值是否包含在具有start
和stop
列的DataFrame对象定义的任何间隔中.
I realize my title is a bit confusing, but I think I can make it clearer if we proceed by example. What I want to do is a vectorized test to check if any of the values in a given series is contained in any of the intervals defined by a DataFrame object with a start
and stop
column.
请考虑系列valid
,这是一个名为trials
的DataFrame的列.这是trials
的样子:
Consider the series, valid
, which is the column of a DataFrame called trials
. Here is what trials
Looks like:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 156 entries, 0 to 155
Data columns (total 3 columns):
start 156 non-null values
stop 156 non-null values
valid 156 non-null values
dtypes: bool(1), float64(2)
我有一个单独的DataFrame,称为"blink".它包含三列:
I have a separate DataFrame called 'blink`. It has three columns:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 41 entries, 0 to 40
Data columns (total 3 columns):
tstart 41 non-null values
tstop 41 non-null values
dur 41 non-null values
dtypes: bool(1), float64(2)
最后一列不是直接相关的:这是眼神的持续时间,即tstop
和tstart
之间的差异.
The last column is not directly relevant: it's the duration of the eyeblik, i.e. the difference betwee tstop
and tstart
.
如果trials['start']
到trials['stop']
的间隔与blink['tstart']
到blink['tstop']
的 any 重叠,我想将trials['valid']
的每一行设置为False
间隔.
I would like to set each row of trials['valid']
to False
if the interval between it's corresponding trials['start']
to trials['stop']
overlaps with any of the blink['tstart']
to blink['tstop']
intervals.
我可以遍历所有行,并使用np.arange
和in
运算符在嵌套循环中完成此操作,但实际上需要几个小时(我的实际数据集比这个虚拟的例子).我可以使用矢量化方法吗?如果没有,有没有一种更快的基于迭代的方法?
I could iterate through the rows and use np.arange
along with the in
operator to do this in a nested loop, but it literally takes hours (my actual data set is much larger than this dummy example). Is there a vectorized approach I could use? If not, is there a faster iteration-based approach?
如果不清楚,我当然会很乐意提供其他详细信息.
If anything is unclear, I'll of course be happy to provide additional details.
推荐答案
您的眨眼数据
In [27]: blink = pd.DataFrame(dict(tstart = [0,10], tstop = [5,15]))
In [28]: blink_s = blink.stack()
In [29]: blink_s.index = [ "%s_%s" % (v,i) for i, v in blink_s.index ]
构造一系列的眨眼(有点像旋转),但是我们需要新的名字
Construct a series of of the blink (kind of like pivoting), but we need new names
In [37]: blink_s
Out[37]:
tstart_0 0
tstop_0 5
tstart_1 10
tstop_1 15
试验数据
In [30]: trial = pd.DataFrame(dict(start = [3,7,12],stop=[4,10,16]))
在试验的各行之间平铺blink_s
Tile the blink_s across rows of the trial
In [32]: blink_df = pd.DataFrame([ blink_s for i in trial.index ])
加入他们
In [33]: x = trial.join(blink_df)
In [34]: x
Out[34]:
start stop tstart_0 tstop_0 tstart_1 tstop_1
0 3 4 0 5 10 15
1 7 10 0 5 10 15
2 12 16 0 5 10 15
那么您的答案就是一个矢量化的布尔表达式(这可能是一个很长的布尔表达式,因此您应该以编程方式生成它,但是这样做并不那么复杂)
Your answer is then a vectorized boolean expression (this maybe be a long one, so you should programatically generate it, but its not that complicated to do that)
In [35]: x['valid'] = ((x.start>x.tstart_0) & (x.stop<=x.tstop_0)) | ((x.start>x.tstart_1)&(x.stop<=x.tstop_1))
In [36]: x
Out[36]:
start stop tstart_0 tstop_0 tstart_1 tstop_1 valid
0 3 4 0 5 10 15 True
1 7 10 0 5 10 15 False
2 12 16 0 5 10 15 False
如果您希望将浮动数据用作tstart/tstop标准,则此方法将起作用.如果将时间间隔限制为仅int数据,则解决方案会更简单一些,因为不必执行所有这些操作,而是可以创建一个包含的单个值系列(例如blink_s),然后执行isin
.
This will work if you want to have float data as your tstart/tstop criteria. If you restrict the intervals to only int data, then the solution is a bit simplier, as instead of doing all this, you can just create a single series of the values that are included (like blink_s), and just do isin
.
从本质上讲,您是将眨眼框展平为一系列,然后可以将其应用于每个审判
In essence you are flattening the blink frame to a series that you then can apply to each of the trial
使用Isin(和OP数据):
Using Isin (and OP data):
转换为int64数据
trial = pd.load('trials.pickle').reindex(columns=['start','stop']).astype('int64')
blink = pd.load('blink.pickle').astype('int64')
添加我们知道的范围内的行
Add in a row that we know is ni the range
trial = trial.append(pd.Series(dict(start=1000,stop=1200)),ignore_index=True)
构造我们要测试的值的范围
Construct the range of values which we want to test
selections = []
for r in blink.iterrows():
e = r[1]
selections.extend(list(np.arange(e['tstart'],e['tstop'])))
selections = pd.Series(selections)
如果传递的开始/停止在选择范围内,则返回true
Return true if the passed start/stop are in the selection range
def f(s):
return s.isin(selections).all()
trial['valid'] = trial.apply(f,axis=1)
trial[trial.valid]
我插入了我知道会通过的1行,没有其他行通过
I inserted 1 row that I knew would pass, no other rows pass
start stop valid
156 1000 1200 True
这篇关于如何检查序列中的值是否包含在定义为DataFrame的行的任何间隔中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!