如何修复由于 Pandas Groupby 中的级别导致的索引错误 [英] How to fix Index Error due to level in Pandas Groupby
问题描述
我有以下 DataFrame badges
.UserId
列包含同一用户的多个条目.我想为给定的 BadgeName
的每个 UserId
获取 Date
的最小值.我创建了一个函数 user_badge_dt
来执行相同的操作,但出现索引错误.需要注意的一点是,尽管所有用户的数据集都是相同的,但我只针对某些徽章而不是其他徽章收到此错误.我不知道为什么会这样.
I have the following DataFrame badges
. The column UserId
includes multiple entries for same user. I want to obtain the minimum value of Date
for every UserId
for a given BadgeName
. I have created a function user_badge_dt
to perform the same but I get Index Error. The point to note is that although the dataset is same for all users, I get this error only for some badges and not for others. I don't know why this is happening.
徽章数据帧的一部分
UserId BadgeName Date
0 23 Curious 2016-01-12T18:44:49.267
1 22 Autobiographer 2017-01-12T18:44:49.267
2 23 Curious 2018-01-12T18:44:49.267
3 20 Autobiographer 2019-01-12T18:44:49.267
4 22 Autobiographer 2020-01-12T18:44:49.267
5 30 Curious 2020-01-12T18:44:49.267
功能
#Function to obtain UserId with the date-time of obtaining given badge for the first time
def user_badge_dt(badge_name):
#Creating DataFrame to obtain all UserId and date-Time of given badge
df = badges[['UserId','Date']].loc[badges.Name == badge]
#Obtaining the first date-time of badge attainment
v = df.groupby("UserId", group_keys=False)['Date'].nsmallest(1)
v.index = v.index.droplevel(1)
df['date'] = df['UserId'].map(v)
df.drop(columns='Date',inplace=True)
#Removing all duplicate values of Users
df.drop_duplicates(subset='UserId', inplace=True )
return df
错误
IndexError: Too many levels: Index has only 1 level, not 2
注意
在进一步检查时,我发现错误是在这条线上引起的v.index = v.index.droplevel(1)
这是因为前面的代码行对不同的徽章名称给出了不同的结果:
This was because the previous code line is giving different results for different badge names:
案例 1:当代码对于给定的徽章正常工作时
CASE 1: When code works correctly for given badge
df = 徽章[['UserId','Date']].loc[badges.Name == '自传']
v = df.groupby("UserId", group_keys=False)['Date'].nsmallest(1)打印(v)
df = badges[['UserId','Date']].loc[badges.Name == 'Autobiographer']
v = df.groupby("UserId", group_keys=False)['Date'].nsmallest(1) print(v)
o/p:
1 22 2017-01-12T18:44:49.267
3 20 2019-01-12T18:44:49.267
(此输出具有 index
、UserId
和给定徽章的 Date
最小值)
(This output has index
, UserId
and minimum value of Date
for given badge)
案例 2:当代码对给定徽章工作不正确时
CASE 2: When code works incorrectly for given badge
df = 徽章[['UserId','Date']].loc[badges.Name == 'Curious']
v = df.groupby("UserId", group_keys=False)['Date'].nsmallest(1)打印(v)
df = badges[['UserId','Date']].loc[badges.Name == 'Curious']
v = df.groupby("UserId", group_keys=False)['Date'].nsmallest(1) print(v)
o/p:
23 2016-01-12T18:44:49.267
30 2020-01-12T18:44:49.267
(此输出没有 index
这就是代码在下一行失败的原因.我不知道它是怎么发生的.)
(This output does not have index
that is why code is failing at the next line. I don't know how is it happening.)
对于任何输入 badge_name
的函数的预期输出应该返回一个带有 UserId
和给定徽章的 Date
最小值的数据帧.如果我的功能不清楚,请提供使用新功能的不同方式来实现此目的.
The expected output of the function for any input badge_name
should return a dataframe with the UserId
and the minimum value of Date
the given badge. If my function is unclear, please provide a different way to achieve this using a new function.
推荐答案
我无法模拟您的错误,但我认为您的解决方案应该使用 DataFrame.sort_values
- 然后获取所有日期最小的第一个用户:
I cannot simulate your error, but I think your solution should be simplify with DataFrame.sort_values
- then get all first users with smallest dates:
badges['Date'] = pd.to_datetime(badges['Date'])
def user_badge_dt(badge_name):
#Creating DataFrame to obtain all UserId and date-Time of given badge
return (badges.loc[badges.BadgeName == badge_name, ['UserId','Date']]
.sort_values('Date')
.drop_duplicates(subset='UserId'))
这篇关于如何修复由于 Pandas Groupby 中的级别导致的索引错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!