如何仅为最后出现的重复行识别和设置列值 [英] How to identify and set a column value for only the last occurrence of a duplicate row
问题描述
我对Pandas和Python还是很陌生,所以请问这是一个基本问题.为了解决我的问题:加载多个csv文件,在后续文件中查找缺失的merchandiseID,然后根据该文件计算出售日期,我对清理这些文件的方式进行了一些更改.我从多个csv文件加载了数据框中的以下列.
I am very new to Pandas and Python, so pardon me if this is a basic question. In an effort to solve my problem: Load multiple csv files, look for missing merchandiseID in subsequent files, calculate the date sold based on it, I made some changes to how I clean these files. I have the following columns in the data frame loaded from multiple csv files.
store_id stock_number merchandise_id date_acquired color price MSRP csv_date
12973 7382 UISN78008 04/11/2017 Red $3200 $3650 01/31/2017
45973 9889 YHAN79807 08/09/2017 White $3600 $3650 01/31/2017
...
45973 9889 YHAN79807 08/09/2017 White $3600 $3650 03/31/2017
最后一列是具有merchandise_id'YHAN79807'的商品的最后一次出现.通过遵循
The last column is the last occurrence of the item with merchandise_id 'YHAN79807'. I was able to find the last occurrence, by following How to identify the first occurence of duplicate rows in Python pandas Dataframe and modifying it a bit. I used
df1['dup_index'] = df1.index.map(lambda ind: g.indices[ind][len(g.indices[ind])-1])
但是,我只想为最后一次出现的"YHAN79807"作为商品ID为"dup_index"列设置此值.我不希望"YHAN79807"作为merchandiseID具有重复数据的其余行具有此值.它们应该为空白.仅最后一次出现应具有此ID.我还不能做到这一点.我尝试了几件事,一个是:
However, I want to set this value for the column 'dup_index' only for the last occurrence of 'YHAN79807' as merchandiseID. I do not want the rest of the rows with duplicated data for 'YHAN79807' as merchandiseID to have this value. They should be blank. Only the last occurrence should have this ID. I have not been able to do that yet. I tried a few things, one was:
group = df1.groupby(['merchandiseID'])
df1_index = df1.set_index(['merchandiseID'])
df1[ (((len(group.indices[ind])-1)==group.indices[df1.merchandiseID])]['dup_index'] = 'succeed'
我尝试添加成功"作为第一步,以查看列比较是否会给我结果,但它给了我以下错误:
I tried adding 'succeed' as a first step to see if the column comparison will give me result, but it gave me the following error:
FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
结果= getattr(x,名称)(y)... 引发TypeError('无法将%s类型与系列'%
result = getattr(x, name)(y) ... raise TypeError('Could not compare %s type with Series' %
我精疲力尽.我想念什么?任何指针都表示赞赏.
I am at my wits end. What am I missing? Any pointers are appreciated.
最好
爱丽丝
推荐答案
我认为您需要:
g = df.groupby(['merchandise_id'])
df1 = df.set_index(['merchandise_id'])
df['dup_index'] = df1.index.map(lambda ind: g.indices[ind][len(g.indices[ind])-1])
print (df)
store_id stock_number merchandise_id date_acquired color price MSRP \
0 12973 7382 UISN78008 04/11/2017 Red $3200 $3650
1 45973 9889 YHAN79807 08/09/2017 White $3600 $3650
2 45973 9889 YHAN79807 08/09/2017 White $3600 $3650
csv_date dup_index
0 01/31/2017 0
1 01/31/2017 2
2 03/31/2017 2
或者,如果需要仅标识最后重复的行,请使用&
的双重条件:
Or if need identify only last duplicated rows use double conditions with &
:
print (df)
store_id stock_number merchandise_id date_acquired color price MSRP \
0 12973 7382 UISN78008 04/11/2017 Red $3200 $3650
1 45973 9889 YHAN79807 08/09/2017 White $3600 $3650
2 45973 9889 YHAN79807 08/09/2017 White $3600 $3650
3 45973 9889 YHAN79807 08/09/2017 White $3600 $3650
csv_date
0 01/31/2017
1 01/31/2017
2 01/31/2017
3 03/31/2017
m1 = ~df.duplicated(['merchandise_id'], keep='last')
m2 = df.duplicated(['merchandise_id'], keep=False)
m = m1 & m2
df.loc[m, 'new'] = 'succeed'
print (df)
store_id stock_number merchandise_id date_acquired color price MSRP \
0 12973 7382 UISN78008 04/11/2017 Red $3200 $3650
1 45973 9889 YHAN79807 08/09/2017 White $3600 $3650
2 45973 9889 YHAN79807 08/09/2017 White $3600 $3650
3 45973 9889 YHAN79807 08/09/2017 White $3600 $3650
csv_date new
0 01/31/2017 NaN
1 01/31/2017 NaN
2 01/31/2017 NaN
3 03/31/2017 succeed
这篇关于如何仅为最后出现的重复行识别和设置列值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!