Python数据帧:到达条件的列的累积和,并返回索引 [英] Python Data Frame: cumulative sum of column until condition is reached and return the index

查看:292
本文介绍了Python数据帧:到达条件的列的累积和,并返回索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Python中是新的,目前我面临着一个我无法解决的问题。我真的希望你能帮助我。英语不是我的本地语言,所以如果我不能正确表达自己,我很抱歉。



说我有一个简单的数据框架,有两列:

  index Num_Albums Num_authors 
0 10 4
1 1 5
2 4 4
3 7 1000
4 1 44
5 3 8

Num_Abums_tot = sum(Num_Albums)= 30

我需要在 Num_Albums 中累积一些数据,直到达到一定条件。注册条件达到的索引,并从 Num_authors 中获取相应的值。



示例:
累计总和 Num_Albums ,直到总和等于30( - > 15±2)的50%±1/15:

  10 = 15±2?不,然后继续; 
10 + 1 = 15±2?不,然后继续
10 + 1 + 41 = 15±2?是的,停止。

条件达到索引2.然后获取 Num_Authors 在该索引: Num_Authors(2)= 4



我想看看是否有一个功能已经实现在大熊猫之前,我开始思考如何用一段时间/ for循环....



[我想指定我想要在相关索引中检索值的列(当我有4列时,这个方便起见,我想在列1中求和元素,条件达到=是,然后得到相应的值列2;然后对列3和4执行相同操作)]。

解决方案



可以使用 cumsum 。然后使用 np。 isclose ,它具有内置的公差参数,以检查此系列中存在的值是否在15 +/- 2的指定阈值内。这将返回一个布尔数组。



通过 np.flatnonzero ,返回 True 条件成立的索引的序数值。我们选择 True 值的第一个实例。



最后,使用 .iloc 根据前面计算的索引检索您所需的列名的值。

  val = np。 flatnonzero(np.isclose(df.Num_Albums.cumsum()。值,15,atol = 2))[0] 
df ['Num_authors']。iloc [val]#用于更快的访问,使用.iat
4

执行 np.isclose 系列后转换为数组:

  np.isclose (df.Num_Albums.cumsum()。值,15,atol = 2)
数组([False,False,True,False,False,False],dtype = bool)

选择 - 2:



使用 cumsum 计算系列中的pd.Index.get_loc 也支持公差 param在最近的方法中。

  val = pd.Index(df .nu​​m_Albums.cumsum())。get_loc(15,'nearest',tolerance = 2)
df.get_value(val,'Num_authors')
4

选择 - 3:



< >使用 idxmax 找到 sub 之后创建的布尔蒙版的 True的第一个索引值和 abs cumsum 系列中的操作:

  df.get_value(df.Num_Albums.cumsum()。sub(15).abs().le(2).idxmax(),'Num_authors')
4


I am new in Python and am currently facing an issue I can't solve. I really hope you can help me out. English is not my native languge so I am sorry if I am not able to express myself properly.

Say I have a simple data frame with two columns:

index  Num_Albums  Num_authors
0      10          4
1      1           5
2      4           4
3      7           1000
4      1           44
5      3           8

Num_Abums_tot = sum(Num_Albums) = 30

I need to do a cumulative sum of the data in Num_Albums until a certain condition is reached. Register the index at which the condition is achieved and get the correspondent value from Num_authors.

Example: cumulative sum of Num_Albums until the sum equals 50% ± 1/15 of 30 (--> 15±2):

10 = 15±2? No, then continue;
10+1 =15±2? No, then continue
10+1+41 = 15±2? Yes, stop. 

Condition reached at index 2. Then get Num_Authors at that index: Num_Authors(2)=4

I would like to see if there's a function already implemented in pandas, before I start thinking how to do it with a while/for loop....

[I would like to specify the column from which I want to retrieve the value at the relevant index (this comes in handy when I have e.g. 4 columns and i want to sum elements in column 1, condition achieved =yes then get the correspondent value in column 2; then do the same with column 3 and 4)].

解决方案

Opt - 1:

You could compute the cumulative sum using cumsum. Then use np.isclose with it's inbuilt tolerance parameter to check if the values present in this series lies within the specified threshold of 15 +/- 2. This returns a boolean array.

Through np.flatnonzero, return the ordinal values of the indices for which the True condition holds. We select the first instance of a True value.

Finally, use .iloc to retrieve value of the column name you require based on the index computed earlier.

val = np.flatnonzero(np.isclose(df.Num_Albums.cumsum().values, 15, atol=2))[0]
df['Num_authors'].iloc[val]      # for faster access, use .iat 
4

When performing np.isclose on the series later converted to an array:

np.isclose(df.Num_Albums.cumsum().values, 15, atol=2)
array([False, False,  True, False, False, False], dtype=bool)

Opt - 2:

Use pd.Index.get_loc on the cumsum calculated series which also supports a tolerance parameter on the nearest method.

val = pd.Index(df.Num_Albums.cumsum()).get_loc(15, 'nearest', tolerance=2)
df.get_value(val, 'Num_authors')
4

Opt - 3:

Use idxmax to find the first index of a True value for the boolean mask created after sub and abs operations on the cumsum series:

df.get_value(df.Num_Albums.cumsum().sub(15).abs().le(2).idxmax(), 'Num_authors')
4

这篇关于Python数据帧:到达条件的列的累积和,并返回索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆