从条件定义的可变行范围中获取局部最大值/最小值? [英] Getting local max/min values from variable row ranges defined with a condition?

查看:74
本文介绍了从条件定义的可变行范围中获取局部最大值/最小值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下问题.我正在研究熊猫的时间序列,该时间序列的列带有直方图,其值有时为正,有时为负.我需要用不断变化的每个时间或范围窗口的局部最大值/最小值填充数据帧的新列,因为这是从正数变为负数到最后一次的行数之间的差从负面变为正面,反之亦然.我还需要使用pandas或numpy方法来提高效率.

I have the following question. I'm working on a time series in pandas that has a column with an histogram with values that sometimes are positive and sometimes negative. I need to fill a new column of the dataframe with the local max/min value for each window of time or range that is constantly changing since is the difference between the row number of the moment it turns from positive to negative to the last time it turned from negative to positive and viceversa. I also need to use a pandas or numpy method for efficiency.

我正在做一个实验,方法是用df.loc填充辅助列,该填充列是上次从正变为负或反之亦然的行位置,如下所示:

I been experimenting by making an auxiliary column with df.loc filled with the row position of the last time it changed from positive to negative or vice-versa like this:

df.loc[(df.Histogram.shift(1) > 0) & (df.Histogram < 0), 'LOC'] = df.index.get_loc(df.Histogram)
df.LOC.fillna(method='ffill')

(导致错误)为了稍后尝试计算这些行位置之间的差异以定义当前的最大/最小时间窗口,然后应用df.Histogram.rolling(loc_differences).max()方法,但由于.rolling仅接受一个固定窗口值,并且由于我无法用位置填充列.我知道必须有一个简单的解决方案.这是我正在寻找的示例:

(Which resulted in an error) To later trying to calculate the differences between these row positions to define the current max/min time window and then apply a df.Histogram.rolling(loc_differences).max() method but I failed to make it work since .rolling only accepts a fixed window value and since I couldn't fill a column with locations. I know there must be a simple solution for this. This is an example of what I'm looking for:

Date             Histogram     Max/Min Value
01/02/2021         0.2            0.7
02/02/2021         0.3            0.7
03/02/2021         0.7            0.7
04/02/2021         0.2            0.7
05/02/2021        -0.2           -0.5
06/02/2021        -0.5           -0.5
07/02/2021        -0.1           -0.5
08/02/2021         0.4            0.4
09/02/2021         0.3            0.4
10/02/2021        -0.2           -0.2 
11/02/2021         0.2            0.7 
12/02/2021         0.7            0.7
13/02/2021         0.2            0.7
14/02/2021         0.3            0.7
15/02/2021         0.6            0.7
16/02/2021         0.2            0.7
17/02/2021        -0.2           -0.5
18/02/2021        -0.5           -0.5
19/02/2021        -0.1           -0.5
20/02/2021         0.4            0.4
21/02/2021         0.3            0.4
22/02/2021        -0.2           -0.3
23/02/2021        -0.1           -0.3 
24/02/2021        -0.3           -0.3
25/02/2021        -0.1           -0.3 
16/02/2021         0.2            0.3
27/02/2021         0.1            0.3    
28/02/2021         0.3            0.3

有解决办法吗?预先感谢.

Is there a solution for this? Thanks in advance.

推荐答案

这是将直方图数据分为正/负值组的便捷方法.每次 grp 列增加时,直方图列都会更改符号,并且具有相同 grp 值的所有行都属于两次符号更改之间的相同间隔.

Here's a handy way to split your histogram data into groups of positive/negative values. Each time the grp column increments, the histogram column changes sign and all rows with the same grp value belong to the same interval between two sign changes.

df['grp'] = (df.Histogram > 0).astype(int).diff().abs().cumsum().fillna(0)

df.head(10)
          Date  Histogram  grp
0   01/02/2021        0.2  0.0
1   02/02/2021        0.3  0.0
2   03/02/2021        0.7  0.0
3   04/02/2021        0.2  0.0
4   05/02/2021       -0.2  1.0
5   06/02/2021       -0.5  1.0
6   07/02/2021       -0.1  1.0
7   08/02/2021        0.4  2.0
8   09/02/2021        0.3  2.0
9   10/02/2021       -0.2  3.0

(df.Histogram> 0)的选择是任意的,并将零作为负值计数.直方图序列 0.2、0、0.4、0.3 将产生组 [0、1、2、2] ,序列为 -0.2、0,-0.4,-0.3 将产生一个单独的组.您必须确定这是否适合您的问题.

The choice of (df.Histogram > 0) is arbitrary and counts zeroes as negative values. A histogram sequence of 0.2, 0, 0.4, 0.3 would yield groups [0, 1, 2, 2], a sequence of -0.2, 0, -0.4, -0.3 would yield a single group. You'll have to determine whether or not that is fine for your problem.

.fillna(0)用来填充 NaN ,该值将出现在 .diff()返回的第一行中.请注意,选择零替换值是合理的:如果第一行到第二行的符号发生了变化,则 grp 在第2行将为1,从而将第1行正确地放入了自己的组中.如果符号没有变化,则 grp 在第2行将为0,并正确将其与第1行分组.

The .fillna(0) is there to fill the NaN that will arise for the first row returned by .diff(). Note that the choice of a zero replacement value is justified: if there was a change of signs from the first to the second row, grp would be 1 on row 2, correctly putting row 1 into its own group. If there was no change of signs, grp will be 0 on row 2, correctly grouping it with row 1.

您现在可以像这样

minmax = df.groupby('grp', as_index=False)['Histogram'].agg(
    {'hist_min': min, 'hist_max': max})
df = df.merge(minmax, on='grp')

df.head(10)
         Date  Histogram  grp  hist_min  hist_max
0  01/02/2021        0.2  0.0       0.2       0.7
1  02/02/2021        0.3  0.0       0.2       0.7
2  03/02/2021        0.7  0.0       0.2       0.7
3  04/02/2021        0.2  0.0       0.2       0.7
4  05/02/2021       -0.2  1.0      -0.5      -0.1
5  06/02/2021       -0.5  1.0      -0.5      -0.1
6  07/02/2021       -0.1  1.0      -0.5      -0.1
7  08/02/2021        0.4  2.0       0.3       0.4
8  09/02/2021        0.3  2.0       0.3       0.4
9  10/02/2021       -0.2  3.0      -0.2      -0.2

最后,您可以使用布尔索引编制所需的值

Finally, you can assemble your desired values using boolean indexing

df['minmax'] = df.hist_min
df.loc[df.Histogram > 0, 'minmax'] = df.hist_max[df.Histogram > 0]

df.head(10)
         Date  Histogram  grp  hist_min  hist_max  minmax
0  01/02/2021        0.2  0.0       0.2       0.7     0.7
1  02/02/2021        0.3  0.0       0.2       0.7     0.7
2  03/02/2021        0.7  0.0       0.2       0.7     0.7
3  04/02/2021        0.2  0.0       0.2       0.7     0.7
4  05/02/2021       -0.2  1.0      -0.5      -0.1    -0.5
5  06/02/2021       -0.5  1.0      -0.5      -0.1    -0.5
6  07/02/2021       -0.1  1.0      -0.5      -0.1    -0.5
7  08/02/2021        0.4  2.0       0.3       0.4     0.4
8  09/02/2021        0.3  2.0       0.3       0.4     0.4
9  10/02/2021       -0.2  3.0      -0.2      -0.2    -0.2

整个过程尽可能地矢量化,因此性能应该不错

The entire process is vectorized as far as possible, so performance should be decent

这篇关于从条件定义的可变行范围中获取局部最大值/最小值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆