问:[Pandas] 如何在非常大的 df 中根据名称为具有多个条目的个人有效地分配唯一 ID [英] Q: [Pandas] How to efficiently assign unique ID to individuals with multiple entries based on name in very large df
问题描述
我想要一个包含一堆不同的独特个体的数据集,每个个体都有多个条目,并为每个个体的所有条目分配一个唯一 id.这是 df 的示例:
I'd like to take a dataset with a bunch of different unique individuals, each with multiple entries, and assign each individual a unique id for all of their entries. Here's an example of the df:
FirstName LastName id
0 Tom Jones 1
1 Tom Jones 1
2 David Smith 1
3 Alex Thompson 1
4 Alex Thompson 1
所以,基本上我希望 Tom Jones 的所有条目的 id=1,David Smith 的所有条目的 id=2,Alex Thompson 的所有条目的 id=3,等等.
So, basically I want all entries for Tom Jones to have id=1, all entries for David Smith to have id=2, all entries for Alex Thompson to have id=3, and so on.
所以我已经有了一个解决方案,它是一个简单的 python 循环,它迭代两个值(一个用于 id,一个用于索引)并根据它们是否与前一个个体匹配来为个体分配一个 id:
So I already have one solution, which is a dead simple python loop iterating two values (One for id, one for index) and assigning the individual an id based on whether they match the previous individual:
x = 1
i = 1
while i < len(df_test):
if (df_test.LastName[i] == df_test.LastName[i-1]) &
(df_test.FirstName[i] == df_test.FirstName[i-1]):
df_test.loc[i, 'id'] = x
i = i+1
else:
x = x+1
df_test.loc[i, 'id'] = x
i = i+1
我遇到的问题是数据帧有大约 900 万个条目,因此使用该循环将花费大量时间来运行.谁能想到一个更有效的方法来做到这一点?我一直在寻找 groupby 和 multiindexing 作为潜在的解决方案,但还没有完全找到正确的解决方案.谢谢!
The problem I'm running into is that the dataframe has about 9 million entries, so with that loop it would have taken a huge amount of time to run. Can anyone think of a more efficient way to do this? I've been looking at groupby and multiindexing as potential solutions, but haven't quite found the right solution yet. Thanks!
推荐答案
可以把姓和名连起来,转换为类别,然后得到代码.
You could join the last name and first name, convert it to a category, and then get the codes.
当然,多个同名的人会有相同的id
.
Of course, multiple people with the same name would have the same id
.
df = df.assign(id=(df['LastName'] + '_' + df['FirstName']).astype('category').cat.codes)
>>> df
FirstName LastName id
0 Tom Jones 0
1 Tom Jones 0
2 David Smith 1
3 Alex Thompson 2
4 Alex Thompson 2
这篇关于问:[Pandas] 如何在非常大的 df 中根据名称为具有多个条目的个人有效地分配唯一 ID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!