如何分配唯一的ID以检测 pandas 数据框中的重复行? [英] How to assign a unique ID to detect repeated rows in a pandas dataframe?

查看:86
本文介绍了如何分配唯一的ID以检测 pandas 数据框中的重复行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理一个大熊猫数据框,其中有几列非常像这样:

I am working with a large pandas dataframe, with several columns pretty much like this:

A      B         C    D   

John   Tom       0    1
Homer  Bart      2    3
Tom    Maggie    1    4 
Lisa   John      5    0
Homer  Bart      2    3
Lisa   John      5    0
Homer  Bart      2    3
Homer  Bart      2    3
Tom    Maggie    1    4

如何为每个重复的行分配唯一的ID?例如:

How can I assign an unique id to each repeated row? For example:

A      B         C    D      new_id

John   Tom       0    1.2      1
Homer  Bart      2    3.0      2
Tom    Maggie    1    4.2      3
Lisa   John      5    0        4
Homer  Bart      2    3        5
Lisa   John      5    0        4
Homer  Bart      2    3.0      2
Homer  Bart      2    3.0      2
Tom    Maggie    1    4.1      6

我知道我可以使用duplicate来检测重复的行,但是我无法想象正在增加这些行.我试图:

I know that I can use duplicate to detect the duplicated rows, however I can not visualize were are reapeting those rows. I tried to:

df.assign(id=(df.columns).astype('category').cat.codes)
df

但是,不起作用.如何获取用于检测重复行组的唯一ID?

However, is not working. How can I get a unique id for detecting groups of duplicated rows?

推荐答案

对于小型数据框,您可以将行转换为可以进行哈希处理的元组,然后使用

For small dataframes, you can convert your rows to tuples, which can be hashed, and then use pd.factorize.

df['new_id'] = pd.factorize(df.apply(tuple, axis=1))[0] + 1

groupby对于较大的数据帧更有效:

groupby is more efficient for larger dataframes:

df['new_id'] = df.groupby(df.columns.tolist(), sort=False).ngroup() + 1

这篇关于如何分配唯一的ID以检测 pandas 数据框中的重复行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆