如何仅为最后出现的重复行识别和设置列值 [英] How to identify and set a column value for only the last occurrence of a duplicate row

查看:65
本文介绍了如何仅为最后出现的重复行识别和设置列值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对Pandas和Python还是很陌生,所以请问这是一个基本问题.为了解决我的问题:加载多个csv文件,在后续文件中查找缺失的merchandiseID,然后根据该文件计算出售日期,我对清理这些文件的方式进行了一些更改.我从多个csv文件加载了数据框中的以下列.

I am very new to Pandas and Python, so pardon me if this is a basic question. In an effort to solve my problem: Load multiple csv files, look for missing merchandiseID in subsequent files, calculate the date sold based on it, I made some changes to how I clean these files. I have the following columns in the data frame loaded from multiple csv files.

store_id stock_number merchandise_id date_acquired color price MSRP csv_date
12973     7382        UISN78008     04/11/2017    Red  $3200 $3650  01/31/2017
45973     9889        YHAN79807     08/09/2017   White $3600 $3650  01/31/2017
...
45973     9889        YHAN79807     08/09/2017   White $3600 $3650  03/31/2017

最后一列是具有merchandise_id'YHAN79807'的商品的最后一次出现.通过遵循

The last column is the last occurrence of the item with merchandise_id 'YHAN79807'. I was able to find the last occurrence, by following How to identify the first occurence of duplicate rows in Python pandas Dataframe and modifying it a bit. I used

 df1['dup_index'] = df1.index.map(lambda ind: g.indices[ind][len(g.indices[ind])-1])

但是,我只想为最后一次出现的"YHAN79807"作为商品ID为"dup_index"列设置此值.我不希望"YHAN79807"作为merchandiseID具有重复数据的其余行具有此值.它们应该为空白.仅最后一次出现应具有此ID.我还不能做到这一点.我尝试了几件事,一个是:

However, I want to set this value for the column 'dup_index' only for the last occurrence of 'YHAN79807' as merchandiseID. I do not want the rest of the rows with duplicated data for 'YHAN79807' as merchandiseID to have this value. They should be blank. Only the last occurrence should have this ID. I have not been able to do that yet. I tried a few things, one was:

group = df1.groupby(['merchandiseID'])
df1_index = df1.set_index(['merchandiseID'])
df1[ (((len(group.indices[ind])-1)==group.indices[df1.merchandiseID])]['dup_index'] = 'succeed'

我尝试添加成功"作为第一步,以查看列比较是否会给我结果,但它给了我以下错误:

I tried adding 'succeed' as a first step to see if the column comparison will give me result, but it gave me the following error:

 FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison

结果= getattr(x,名称)(y)... 引发TypeError('无法将%s类型与系列'%

result = getattr(x, name)(y) ... raise TypeError('Could not compare %s type with Series' %

我精疲力尽.我想念什么?任何指针都表示赞赏.

I am at my wits end. What am I missing? Any pointers are appreciated.

最好

爱丽丝

推荐答案

我认为您需要:

g = df.groupby(['merchandise_id'])
df1 = df.set_index(['merchandise_id'])
df['dup_index'] = df1.index.map(lambda ind: g.indices[ind][len(g.indices[ind])-1])
print (df)
   store_id  stock_number merchandise_id date_acquired  color  price   MSRP  \
0     12973          7382      UISN78008    04/11/2017    Red  $3200  $3650   
1     45973          9889      YHAN79807    08/09/2017  White  $3600  $3650   
2     45973          9889      YHAN79807    08/09/2017  White  $3600  $3650   

     csv_date  dup_index  
0  01/31/2017          0  
1  01/31/2017          2  
2  03/31/2017          2  

或者,如果需要仅标识最后重复的行,请使用&的双重条件:

Or if need identify only last duplicated rows use double conditions with &:

print (df)
   store_id  stock_number merchandise_id date_acquired  color  price   MSRP  \
0     12973          7382      UISN78008    04/11/2017    Red  $3200  $3650   
1     45973          9889      YHAN79807    08/09/2017  White  $3600  $3650   
2     45973          9889      YHAN79807    08/09/2017  White  $3600  $3650   
3     45973          9889      YHAN79807    08/09/2017  White  $3600  $3650   

     csv_date  
0  01/31/2017  
1  01/31/2017  
2  01/31/2017  
3  03/31/2017  


m1 = ~df.duplicated(['merchandise_id'], keep='last')
m2 = df.duplicated(['merchandise_id'], keep=False)
m = m1 & m2
df.loc[m, 'new'] = 'succeed'
print (df)
   store_id  stock_number merchandise_id date_acquired  color  price   MSRP  \
0     12973          7382      UISN78008    04/11/2017    Red  $3200  $3650   
1     45973          9889      YHAN79807    08/09/2017  White  $3600  $3650   
2     45973          9889      YHAN79807    08/09/2017  White  $3600  $3650   
3     45973          9889      YHAN79807    08/09/2017  White  $3600  $3650   

     csv_date      new  
0  01/31/2017      NaN  
1  01/31/2017      NaN  
2  01/31/2017      NaN  
3  03/31/2017  succeed  

这篇关于如何仅为最后出现的重复行识别和设置列值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆