在Pandas DataFrame中用大于某个阈值的值来界定连续区域 [英] Delimiting contiguous regions with values above a certain threshold in Pandas DataFrame

查看:1309
本文介绍了在Pandas DataFrame中用大于某个阈值的值来界定连续区域的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个索引和值在0到1之间的Pandas数据框,如下所示:

I have a Pandas Dataframe of indices and values between 0 and 1, something like this:

 6  0.047033
 7  0.047650
 8  0.054067
 9  0.064767
10  0.073183
11  0.077950

我想检索超过5个连续值且均超过某个阈值(例如0.5)的区域的起点和终点的元组.这样我会有这样的东西:

I would like to retrieve tuples of the start and end points of regions of more than 5 consecutive values that are all over a certain threshold (e.g. 0.5). So that I would have something like this:

 [(150, 185), (632, 680), (1500,1870)]

第一个元组的区域从索引150开始,具有35个值,这些值在行中都大于0.5,并在索引185(不包括端值)处结束.

Where the first tuple is of a region that starts at index 150, has 35 values that are all above 0.5 in row, and ends on index 185 non-inclusive.

我一开始只过滤大于0.5的值

I started by filtering for only values above 0.5 like so

 df = df[df['values'] >= 0.5]

现在我有这样的值:

632  0.545700
633  0.574983
634  0.572083
635  0.595500
636  0.632033
637  0.657617
638  0.643300
639  0.646283

我无法显示我的实际数据集,但是以下内容应该是一个很好的表示

I can't show my actual dataset, but the following one should be a good representation

import numpy as np
from pandas import *

np.random.seed(seed=901212)

df = DataFrame(range(1,501), columns=['indices'])
df['values'] = np.random.rand(500)*.5 + .35

收益:

 1  0.491233
 2  0.538596
 3  0.516740
 4  0.381134
 5  0.670157
 6  0.846366
 7  0.495554
 8  0.436044
 9  0.695597
10  0.826591
...

其中区域(2,4)具有两个大于0.5的值. 但是这太短了.另一方面,将连续19个值大于0.5的区域(25,44)添加到列表中.

Where the region (2,4) has two values above 0.5. However this would be too short. On the other hand, the region (25,44) with 19 values above 0.5 in a row would be added to list.

推荐答案

通过查看序列和1行移位的值,可以找到每个连续区域的第一个和最后一个元素,然后过滤足够分开的对彼此之间:

You can find the first and last element of each consecutive region by looking at the series and 1-row shifted values, and then filter the pairs which are adequately apart from each other:

# tag rows based on the threshold
df['tag'] = df['values'] > .5

# first row is a True preceded by a False
fst = df.index[df['tag'] & ~ df['tag'].shift(1).fillna(False)]

# last row is a True followed by a False
lst = df.index[df['tag'] & ~ df['tag'].shift(-1).fillna(False)]

# filter those which are adequately apart
pr = [(i, j) for i, j in zip(fst, lst) if j > i + 4]

例如,第一个区域是:

>>> i, j = pr[0]
>>> df.loc[i:j]
    indices    values   tag
15       16  0.639992  True
16       17  0.593427  True
17       18  0.810888  True
18       19  0.596243  True
19       20  0.812684  True
20       21  0.617945  True

这篇关于在Pandas DataFrame中用大于某个阈值的值来界定连续区域的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆