问:[Pandas] 如何在非常大的 df 中根据名称为具有多个条目的个人有效地分配唯一 ID [英] Q: [Pandas] How to efficiently assign unique ID to individuals with multiple entries based on name in very large df

查看:15
本文介绍了问:[Pandas] 如何在非常大的 df 中根据名称为具有多个条目的个人有效地分配唯一 ID的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想要一个包含一堆不同的独特个体的数据集,每个个体都有多个条目,并为每个个体的所有条目分配一个唯一 id.这是 df 的示例:

I'd like to take a dataset with a bunch of different unique individuals, each with multiple entries, and assign each individual a unique id for all of their entries. Here's an example of the df:

      FirstName LastName  id
0     Tom       Jones     1
1     Tom       Jones     1
2     David     Smith     1
3     Alex      Thompson  1
4     Alex      Thompson  1

所以,基本上我希望 Tom Jones 的所有条目的 id=1,David Smith 的所有条目的 id=2,Alex Thompson 的所有条目的 id=3,等等.

So, basically I want all entries for Tom Jones to have id=1, all entries for David Smith to have id=2, all entries for Alex Thompson to have id=3, and so on.

所以我已经有了一个解决方案,它是一个简单的 python 循环,它迭代两个值(一个用于 id,一个用于索引)并根据它们是否与前一个个体匹配来为个体分配一个 id:

So I already have one solution, which is a dead simple python loop iterating two values (One for id, one for index) and assigning the individual an id based on whether they match the previous individual:

x = 1
i = 1

while i < len(df_test):
    if (df_test.LastName[i] == df_test.LastName[i-1]) & 
    (df_test.FirstName[i] == df_test.FirstName[i-1]):
        df_test.loc[i, 'id'] = x
        i = i+1
    else:
        x = x+1
        df_test.loc[i, 'id'] = x
        i = i+1

我遇到的问题是数据帧有大约 900 万个条目,因此使用该循环将花费大量时间来运行.谁能想到一个更有效的方法来做到这一点?我一直在寻找 groupby 和 multiindexing 作为潜在的解决方案,但还没有完全找到正确的解决方案.谢谢!

The problem I'm running into is that the dataframe has about 9 million entries, so with that loop it would have taken a huge amount of time to run. Can anyone think of a more efficient way to do this? I've been looking at groupby and multiindexing as potential solutions, but haven't quite found the right solution yet. Thanks!

推荐答案

可以把姓和名连起来,转换为类别,然后得到代码.

You could join the last name and first name, convert it to a category, and then get the codes.

当然,多个同名的人会有相同的id.

Of course, multiple people with the same name would have the same id.

df = df.assign(id=(df['LastName'] + '_' + df['FirstName']).astype('category').cat.codes)
>>> df
  FirstName  LastName  id
0       Tom     Jones   0
1       Tom     Jones   0
2     David     Smith   1
3      Alex  Thompson   2
4      Alex  Thompson   2

这篇关于问:[Pandas] 如何在非常大的 df 中根据名称为具有多个条目的个人有效地分配唯一 ID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆