idxmax()在包含NaN的SeriesGroupBy上不起作用 [英] idxmax() doesn't work on SeriesGroupBy that contains NaN

查看：84 发布时间：2020/5/24 4:24:09 python pandas

本文介绍了idxmax()在包含NaN的SeriesGroupBy上不起作用的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这是我的代码

from pandas import DataFrame, Series
import pandas as pd
import numpy as np
income = DataFrame({'name': ['Adam', 'Bill', 'Chris', 'Dave', 'Edison', 'Frank'],
                    'age': [22, 24, 31, 45, 51, 55],
                    'income': [1000, 2500, 1200, 1500, 1300, 1600],
                    })
ageBin = pd.cut(income.age, [20, 30, 40, 50, 60])
grouped = income.groupby([ageBin])
highestIncome = income.ix[grouped.income.idxmax()]

我有一个DataFrame，其中包含姓名，年龄和收入，如下所示:

I have a DataFrame that contains names, ages and income as follows:

index   age income  name
0   22  1000    Adam
1   24  2500    Bill
2   31  1200    Chris
3   45  1500    Dave
4   51  1300    Edison
5   55  1600    Frank

我想按年龄段对数据进行分组，并收集收入最高的记录.上面的代码有效，highestIncome是:

I would like to group the data by the age bins and collect the records with highest income. The code above works and the highestIncome is:

index   age income  name
1   24  2500    Bill
2   31  1200    Chris
3   45  1500    Dave
5   55  1600    Frank

但是，如果我删除Chris的记录，因此在(30，40] 的年龄范围内没有记录，我会在grouped.income.idxmax()得到一个ValueError.是由于NaN在分组中，但是我找不到解决问题的方法.感谢您的任何投入.

However, if I delete the record of Chris and thus there is no record within the age range of (30, 40], I get a ValueError at grouped.income.idxmax(). I think this is because of the NaN in grouped, but I cannot find a way to solve the problem. Any input is appreciated.

更新:非常感谢您的回答.我确实相信这是针对groupby对象的idxmax()上的错误.我想使用agg(lambda x: x.idxmax())方法，因为我对一千万个合成数据集上使用sort() vs agg(lambda x: x.idxmax()的速度进行了测试.这是代码和输出:

Update: Thanks a lot for the answers. I do believe this is a bug on idxmax() for groupby objects. I would like to go with the agg(lambda x: x.idxmax()) approach as I did a test of the speed of using sort() vs agg(lambda x: x.idxmax() on a 10 million synthetic data set. Here is the code and the output:

from pandas import DataFrame, Series
import pandas as pd
import numpy as np
import time

testData = DataFrame({'key': np.random.randn(10000000),
                      'value': np.random.randn(10000000)})
keyBin = pd.cut(testData.key, 1000)

start = time.time()
grouped1 = testData.sort('value', ascending=False).groupby([keyBin])
highestValues1 = testData.ix[grouped1.head(1).index]
end = time.time()
print end - start

start = time.time()
grouped2 = testData.groupby([keyBin])
highestValues2 = testData.ix[grouped2.value.agg(lambda x: x.idxmax())].dropna(how='all')
end = time.time()
print end - start
#validation
(highestValues1.sort() == highestValues2.sort()).all()

输出:

5.30953717232
1.0279238224

Out[47]:

key      True
value    True
dtype: bool

idxmax()在包含NaN的SeriesGroupBy上不起作用 [英] idxmax() doesn't work on SeriesGroupBy that contains NaN

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

idxmax()在包含NaN的SeriesGroupBy上不起作用 [英] idxmax() doesn&#39;t work on SeriesGroupBy that contains NaN

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

idxmax()在包含NaN的SeriesGroupBy上不起作用 [英] idxmax() doesn't work on SeriesGroupBy that contains NaN

登录关闭