如何使用 Pandas 操作数组中的数据(并重置评估) [英] How to manipulate data in arrays using pandas (and resetting evaluations)

查看:44
本文介绍了如何使用 Pandas 操作数组中的数据(并重置评估)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了清楚起见,我已经修改了问题并删除了工件和不一致之处 - 请重新打开以供社区考虑.一位贡献者已经认为将 groupby 与 cummax 结合使用可能会得到一个解决方案.

I've revised the question for clarity and removed artifacts and inconsistencies - please reopen for consideration by the community. One contributor already thinks a solution might be possible with groupby in combination with cummax.

我有一个数据框,其中 col3 的先验值和 col2 的当前值之间的最大值是通过 Scott Boston 最近提供的 cummax 函数计算的(谢谢!)如下:

I have a dataframe in which the max between prior value of col3 and current value of col2 is evaluated through a cummax function recently offered by Scott Boston (thanks!) as follows:

df['col3'] = df['col2'].shift(-1).cummax().shift(). 

生成的数据框如下所示.还添加了所需的逻辑,将 col2 与作为浮点类型值结果的设置点进行比较.

The resulting dataframe is shown below. Also added the desired logic that compares col2 to a setpoint that is a result of float type value.

操作cummax的结果:

result of operating cummax:

   col0  col1  col2  col3
0     1   5.0  2.50   NaN
1     2   4.9  2.45  2.45
2     3   5.5  2.75  2.75
3     4   3.5  1.75  2.75
4     5   3.1  1.55  2.75
5     6   4.5  2.25  2.75
6     7   5.5  2.75  2.75
7     8   1.2  0.6   2.75
8     9   5.8  2.90  2.90

在上例中,希望在 col3 >= setpoint 或 2.71 时标记 True,以便每次 col3 的最新行都超过 setpoint.

The desire is to flag True when col3 >= setpoint or 2.71 in the above example such that every time col3's most recent row exceeds setpoint.

问题:达到设定点时,cummax 解决方案不会重置.需要一个解决方案,在每次违反设定点时重置 cummax 计算.例如在上表中,当 col3 超过设定值时,即 col2 值为 2.75 时,在第一个 True 之后,第二次它应该满足相同的条件,即显示在我删除了 col3 的扩展数据表中第 4 行中的值来说明需要重置"cummax 计算.在 if 语句中,我使用下标 [-1] 来定位 df 中的最后一行(即最近的).注意:col2=col1*constant1 的当前值,其中 constant1 == 0.5

The problem: cummax solution does not reset when setpoint is reached. Need a solution that resets the cummax calculation every time it breaches setpoint. For example in the table above, after the first True when col3 exceeds the setpoint, i.e. col2 value is 2.75, there is a second time when it should satisfy the same condition, i.e. shown as in the extended data table where I’ve deleted col3's value in row 4 to illustrate the need to ‘reset’ the cummax calc. In the if statement, I am using subscript [-1] to target the last row in the df (i.e. most recent). Note: col2=current value of col1*constant1 where constant1 == 0.5

到目前为止尝试过的代码(注意 col3 没有正确重置):

Code tried so far (note that col3 is not resetting properly):

if self.constant is not None: setpoint = self.constant * (1-self.temp)  # suppose setpoint == 2.71
df = pd.DataFrame({'col0':[1,2,3,4,5,6,7,8,9]
              ,'col1':[5,4.9,5.5,3.5,3.1,4.5,5.5,1.2,5.8]
              ,'col2':[2.5,2.45,2.75,1.75,1.55,2.25,2.75,0.6,2.9]
              ,'col3':[NaN,2.45,2.75,2.75,2.75,2.75,2.75,2.75,2.9]
              })

if df[‘col3’][-1] >= setpoint:
    self.log(‘setpoint hit')
    return True

Cummax 解决方案需要调整:col3 应该评估 col2 和 col3 的基础值,一旦超出设定点(col3 为 2.71),下一个 col3 值应重置为 NaN 并开始新的 cummax.col3 的正确输出应该是:[NaN,2.45,2.75,NaN,1.55,2.25,2.75,NaN,2.9] 并在 col3 的最后一行违反设定值 2.71 时一次又一次地返回 True.

Cummax solution needs tweaking: col3 is supposed to evaluate based value of col2 and col3 and once the setpoint is breached (2.71 for col3), the next col3 value should reset to NaN and start a new cummax. The correct output for col3 should be:[NaN,2.45,2.75,NaN,1.55,2.25,2.75,NaN,2.9] and return True again and again when the last row of col3 breaches setpoint value 2.71.

操作 cummax 和对 col3 进行额外调整的预期结果(可能使用引用 col2 的 groupby?):每次违反设定点时返回 True.这是生成的 col3 的一个示例:

Desired result of operating cummax and additional tweaking for col3 (possibly with groupby that references col2?): return True every time setpoint is breached. Here's one example of the resulting col3:

   col0  col1  col2  col3
0     1   5.0  2.50   NaN
1     2   4.9  2.45  2.45
2     3   5.5  2.75  2.75
3     4   3.5  1.75   NaN
4     5   3.1  1.55  1.55
5     6   4.5  2.25  2.25
6     7   5.5  2.75  2.75
7     8   1.2  0.60   NaN
8     9   5.8  2.90  2.90

接受关于是在发生违规的行还是在如上所示的下一行返回 NaN 的建议(关键是让 if 语句在设置点被违反时立即解决 True).

Open to suggestions on whether NaN is returned on the row the breach occurs or on next row shown as above (key desire is for if statement to resolve True as soon as setpoint is breached).

推荐答案

尝试:

import pandas as pd
import numpy as np

df = pd.DataFrame({'col0':[1,2,3,4,5,6,7,8,9]
              ,'col1':[5,4.9,5.5,3.5,3.1,4.5,5.5,1.2,5.8]
              ,'col2':[2.5,2.45,2.75,1.75,1.55,2.25,2.75,0.6,2.9]
              ,'col3':[np.nan,2.45,2.75,2.75,2.75,2.75,2.75,2.75,2.9]
              })


threshold = 2.71

grp = df['col2'].ge(threshold).cumsum().shift().bfill()

df['col3'] = df['col2'].groupby(grp).transform(lambda x: x.shift(-1).cummax().shift())

print(df)

输出:

   col0  col1  col2  col3
0     1   5.0  2.50   NaN
1     2   4.9  2.45  2.45
2     3   5.5  2.75  2.75
3     4   3.5  1.75   NaN
4     5   3.1  1.55  1.55
5     6   4.5  2.25  2.25
6     7   5.5  2.75  2.75
7     8   1.2  0.60   NaN
8     9   5.8  2.90  2.90

详情:

使用大于或等于阈值创建分组,然后使用 groupby 和转换将相同的逻辑应用于数据帧中的每个组.

Create grouping using greater or equal to threshold, then apply the same logic to each group withn at the dataframe using groupby with transform.

这篇关于如何使用 Pandas 操作数组中的数据(并重置评估)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆