问:[Pandas]如何根据非常大的df中的名称有效地为具有多个条目的个人分配唯一ID [英] Q: [Pandas] How to efficiently assign unique ID to individuals with multiple entries based on name in very large df

查看:269
本文介绍了问:[Pandas]如何根据非常大的df中的名称有效地为具有多个条目的个人分配唯一ID的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想要一组数据集,其中包含许多不同的独特个体,每个人都有多个条目,并为每个人分配一个唯一的ID,用于所有条目。以下是df的示例:

I'd like to take a dataset with a bunch of different unique individuals, each with multiple entries, and assign each individual a unique id for all of their entries. Here's an example of the df:

      FirstName LastName  id
0     Tom       Jones     1
1     Tom       Jones     1
2     David     Smith     1
3     Alex      Thompson  1
4     Alex      Thompson  1

所以,基本上我希望Tom Jones的所有条目都有id = 1,David Smith的所有条目都有id = 2,Alex Thompson的所有条目都有id = 3,依此类推。

So, basically I want all entries for Tom Jones to have id=1, all entries for David Smith to have id=2, all entries for Alex Thompson to have id=3, and so on.

所以我已经有了一个解决方案,这是一个死的简单python循环迭代两个值(一个用于id,一个用于索引)并根据是否为个人分配一个id它们匹配前一个人:

So I already have one solution, which is a dead simple python loop iterating two values (One for id, one for index) and assigning the individual an id based on whether they match the previous individual:

x = 1
i = 1

while i < len(df_test):
    if (df_test.LastName[i] == df_test.LastName[i-1]) & 
    (df_test.FirstName[i] == df_test.FirstName[i-1]):
        df_test.loc[i, 'id'] = x
        i = i+1
    else:
        x = x+1
        df_test.loc[i, 'id'] = x
        i = i+1

我遇到的问题是数据框有大约900万个条目,因此使用该循环会花费大量时间来运行。谁能想到更有效的方法呢?我一直在寻找groupby和multiindexing作为潜在的解决方案,但尚未找到合适的解决方案。谢谢!

The problem I'm running into is that the dataframe has about 9 million entries, so with that loop it would have taken a huge amount of time to run. Can anyone think of a more efficient way to do this? I've been looking at groupby and multiindexing as potential solutions, but haven't quite found the right solution yet. Thanks!

推荐答案

您可以加入姓氏和名字,将其转换为类别,然后获取代码。

You could join the last name and first name, convert it to a category, and then get the codes.

当然,多个具有相同名称的人将具有相同的 id

Of course, multiple people with the same name would have the same id.

df = df.assign(id=(df['LastName'] + '_' + df['FirstName']).astype('category').cat.codes)
>>> df
  FirstName  LastName  id
0       Tom     Jones   0
1       Tom     Jones   0
2     David     Smith   1
3      Alex  Thompson   2
4      Alex  Thompson   2

这篇关于问:[Pandas]如何根据非常大的df中的名称有效地为具有多个条目的个人分配唯一ID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆