如何检查序列中的值是否包含在定义为DataFrame的行的任何间隔中? [英] How can I check if the values in a series are contained in any of the intervals defined the rows of a DataFrame?

查看:151
本文介绍了如何检查序列中的值是否包含在定义为DataFrame的行的任何间隔中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我意识到我的头衔有些混乱,但是如果我们以身作则,我想可以更清楚一些.我要做的是向量化测试,以检查给定系列中的任何值是否包含在具有startstop列的DataFrame对象定义的任何间隔中.

I realize my title is a bit confusing, but I think I can make it clearer if we proceed by example. What I want to do is a vectorized test to check if any of the values in a given series is contained in any of the intervals defined by a DataFrame object with a start and stop column.

请考虑系列valid,这是一个名为trials的DataFrame的列.这是trials的样子:

Consider the series, valid, which is the column of a DataFrame called trials. Here is what trials Looks like:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 156 entries, 0 to 155
Data columns (total 3 columns):
start    156  non-null values
stop     156  non-null values
valid    156  non-null values
dtypes: bool(1), float64(2)

我有一个单独的DataFrame,称为"blink".它包含三列:

I have a separate DataFrame called 'blink`. It has three columns:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 41 entries, 0 to 40
Data columns (total 3 columns):
tstart    41  non-null values
tstop     41  non-null values
dur       41  non-null values
dtypes: bool(1), float64(2)

最后一列不是直接相关的:这是眼神的持续时间,即tstoptstart之间的差异.

The last column is not directly relevant: it's the duration of the eyeblik, i.e. the difference betwee tstop and tstart.

如果trials['start']trials['stop']的间隔与blink['tstart']blink['tstop'] any 重叠,我想将trials['valid']的每一行设置为False间隔.

I would like to set each row of trials['valid'] to False if the interval between it's corresponding trials['start'] to trials['stop'] overlaps with any of the blink['tstart'] to blink['tstop'] intervals.

可以遍历所有行,并使用np.arangein运算符在嵌套循环中完成此操作,但实际上需要几个小时(我的实际数据集比这个虚拟的例子).我可以使用矢量化方法吗?如果没有,有没有一种更快的基于迭代的方法?

I could iterate through the rows and use np.arange along with the in operator to do this in a nested loop, but it literally takes hours (my actual data set is much larger than this dummy example). Is there a vectorized approach I could use? If not, is there a faster iteration-based approach?

如果不清楚,我当然会很乐意提供其他详细信息.

If anything is unclear, I'll of course be happy to provide additional details.

推荐答案

您的眨眼数据

In [27]: blink = pd.DataFrame(dict(tstart = [0,10], tstop = [5,15]))

In [28]: blink_s = blink.stack()

In [29]: blink_s.index = [ "%s_%s" % (v,i) for i, v in blink_s.index ]

构造一系列的眨眼(有点像旋转),但是我们需要新的名字

Construct a series of of the blink (kind of like pivoting), but we need new names

In [37]: blink_s
Out[37]: 
tstart_0     0
tstop_0      5
tstart_1    10
tstop_1     15

试验数据

In [30]: trial = pd.DataFrame(dict(start = [3,7,12],stop=[4,10,16]))

在试验的各行之间平铺blink_s

Tile the blink_s across rows of the trial

In [32]: blink_df = pd.DataFrame([ blink_s for i in trial.index ])

加入他们

In [33]: x = trial.join(blink_df)

In [34]: x
Out[34]: 
   start  stop  tstart_0  tstop_0  tstart_1  tstop_1
0      3     4         0        5        10       15
1      7    10         0        5        10       15
2     12    16         0        5        10       15

那么您的答案就是一个矢量化的布尔表达式(这可能是一个很长的布尔表达式,因此您应该以编程方式生成它,但是这样做并不那么复杂)

Your answer is then a vectorized boolean expression (this maybe be a long one, so you should programatically generate it, but its not that complicated to do that)

In [35]: x['valid'] = ((x.start>x.tstart_0) & (x.stop<=x.tstop_0)) | ((x.start>x.tstart_1)&(x.stop<=x.tstop_1))

In [36]: x
Out[36]: 
   start  stop  tstart_0  tstop_0  tstart_1  tstop_1  valid
0      3     4         0        5        10       15   True
1      7    10         0        5        10       15  False
2     12    16         0        5        10       15  False

如果您希望将浮动数据用作tstart/tstop标准,则此方法将起作用.如果将时间间隔限制为仅int数据,则解决方案会更简单一些,因为不必执行所有这些操作,而是可以创建一个包含的单个值系列(例如blink_s),然后执行isin.

This will work if you want to have float data as your tstart/tstop criteria. If you restrict the intervals to only int data, then the solution is a bit simplier, as instead of doing all this, you can just create a single series of the values that are included (like blink_s), and just do isin.

从本质上讲,您是将眨眼框展平为一系列,然后可以将其应用于每个审判

In essence you are flattening the blink frame to a series that you then can apply to each of the trial

使用Isin(和OP数据):

Using Isin (and OP data):

转换为int64数据

trial = pd.load('trials.pickle').reindex(columns=['start','stop']).astype('int64')
blink = pd.load('blink.pickle').astype('int64')

添加我们知道的范围内的行

Add in a row that we know is ni the range

trial = trial.append(pd.Series(dict(start=1000,stop=1200)),ignore_index=True)

构造我们要测试的值的范围

Construct the range of values which we want to test

selections = []
for r in blink.iterrows():
    e = r[1]
    selections.extend(list(np.arange(e['tstart'],e['tstop'])))
selections = pd.Series(selections)

如果传递的开始/停止在选择范围内,则返回true

Return true if the passed start/stop are in the selection range

def f(s):
    return s.isin(selections).all()
trial['valid'] = trial.apply(f,axis=1)

trial[trial.valid]

我插入了我知道会通过的1行,没有其他行通过

I inserted 1 row that I knew would pass, no other rows pass

     start  stop valid
156   1000  1200  True

这篇关于如何检查序列中的值是否包含在定义为DataFrame的行的任何间隔中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆