idxmax()在包含NaN的SeriesGroupBy上不起作用 [英] idxmax() doesn't work on SeriesGroupBy that contains NaN
问题描述
这是我的代码
from pandas import DataFrame, Series
import pandas as pd
import numpy as np
income = DataFrame({'name': ['Adam', 'Bill', 'Chris', 'Dave', 'Edison', 'Frank'],
'age': [22, 24, 31, 45, 51, 55],
'income': [1000, 2500, 1200, 1500, 1300, 1600],
})
ageBin = pd.cut(income.age, [20, 30, 40, 50, 60])
grouped = income.groupby([ageBin])
highestIncome = income.ix[grouped.income.idxmax()]
我有一个DataFrame,其中包含姓名,年龄和收入,如下所示:
I have a DataFrame that contains names, ages and income as follows:
index age income name
0 22 1000 Adam
1 24 2500 Bill
2 31 1200 Chris
3 45 1500 Dave
4 51 1300 Edison
5 55 1600 Frank
我想按年龄段对数据进行分组,并收集收入最高的记录.上面的代码有效,highestIncome
是:
I would like to group the data by the age bins and collect the records with highest income. The code above works and the highestIncome
is:
index age income name
1 24 2500 Bill
2 31 1200 Chris
3 45 1500 Dave
5 55 1600 Frank
但是,如果我删除Chris的记录,因此在(30,40] 的年龄范围内没有记录,我会在grouped.income.idxmax()
得到一个ValueError
.是由于NaN
在分组中,但是我找不到解决问题的方法.感谢您的任何投入.
However, if I delete the record of Chris and thus there is no record within the age range of (30, 40], I get a ValueError
at grouped.income.idxmax()
. I think this is because of the NaN
in grouped, but I cannot find a way to solve the problem. Any input is appreciated.
更新:非常感谢您的回答.我确实相信这是针对groupby对象的idxmax()上的错误.我想使用agg(lambda x: x.idxmax())
方法,因为我对一千万个合成数据集上使用sort()
vs agg(lambda x: x.idxmax()
的速度进行了测试.这是代码和输出:
Update: Thanks a lot for the answers. I do believe this is a bug on idxmax() for groupby objects. I would like to go with the agg(lambda x: x.idxmax())
approach as I did a test of the speed of using sort()
vs agg(lambda x: x.idxmax()
on a 10 million synthetic data set. Here is the code and the output:
from pandas import DataFrame, Series
import pandas as pd
import numpy as np
import time
testData = DataFrame({'key': np.random.randn(10000000),
'value': np.random.randn(10000000)})
keyBin = pd.cut(testData.key, 1000)
start = time.time()
grouped1 = testData.sort('value', ascending=False).groupby([keyBin])
highestValues1 = testData.ix[grouped1.head(1).index]
end = time.time()
print end - start
start = time.time()
grouped2 = testData.groupby([keyBin])
highestValues2 = testData.ix[grouped2.value.agg(lambda x: x.idxmax())].dropna(how='all')
end = time.time()
print end - start
#validation
(highestValues1.sort() == highestValues2.sort()).all()
输出:
5.30953717232
1.0279238224
Out[47]:
key True
value True
dtype: bool
推荐答案
grouped['income'].agg(lambda x : x.idxmax())
Out[]:
age
(20, 30] 1
(30, 40] NaN
(40, 50] 2
(50, 60] 4
Name: income, dtype: float64
然后您可以执行以下操作以获取数据
and then you can do the following to get the data
income.ix[result.values].dropna()
这篇关于idxmax()在包含NaN的SeriesGroupBy上不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!