用第一个唯一值替换其他重复行列并创建查找 [英] Replace other columns of duplicate rows with first unique value and create lookup
问题描述
这是数据 -
Account_Number Dummy_Account
1050080713252 ACC0000000000001
1050223213427 ACC0000000000002
1050080713252 ACC0000000169532
1105113502309 ACC0000000123005
1100043521537 ACC0000000000004
1100045301840 ACC0000000000005
1105113502309 ACC0000000000040
行 1,3
在 Account_Number
中有重复的值.行 4,7
也是如此.我需要用 Dummy_Account
中的相同值替换 Account_Number
中的重复值.所以对于 1050080713252
,两行 1,3
应该有相同的虚拟值 ACC0000000000001
.但是我不想直接替换,而是想保留原来的映射.
Rows 1,3
have duplicate values in Account_Number
. So do rows 4,7
.
I need to replace the duplicate values in Account_Number
with the same values in Dummy_Account
. So for 1050080713252
, both rows 1,3
should have same dummy values ACC0000000000001
. But instead of replacing directly, I want to keep the original mapping.
我的预期输出是 -
Account_Number_Map Dummy_Account_Original
ACC0000000000001 ACC0000000000001
ACC0000000000002 ACC0000000000002
ACC0000000000001 ACC0000000169532
ACC0000000123005 ACC0000000123005
ACC0000000000004 ACC0000000000004
ACC0000000000005 ACC0000000000005
ACC0000000123005 ACC0000000000040
由于 ACC0000000169532
是重复的 Dummy_Account
wrt Account_Number
,我想创建一个查找,用 ACC00000000000001
替换它>
Since ACC0000000169532
is the duplicate Dummy_Account
w.r.t Account_Number
, I want to create a lookup that replaces this with ACC0000000000001
我的尝试
我开始像这样创建一个 dict
-
I started with creating a dict
like this -
maps = dict(zip(df.Dummy_Account, df.Account_Number))
我想创建一个 dict
将原始 Dummy_Account
值作为 key
和新的 Dummy_Account
值作为值
但我有点失落.我的数据集很大,所以我也在寻找优化的解决方案.
I figured creating a dict
that will have the original Dummy_Account
values as key
and new Dummy_Account
values as value
But I am a little lost. My dataset is large so I am also looking at optimized solutions.
推荐答案
使用 drop_duplicates
,创建一个您将传递给 map
的系列:
Using drop_duplicates
, create a Series which you'll pass to map
:
m = df.drop_duplicates('Account_Number', keep='first')\
.set_index('Account_Number')\
.Dummy_Account
df.Account_Number = df.Account_Number.map(m)
df
Account_Number Dummy_Account
0 ACC0000000000001 ACC0000000000001
1 ACC0000000000002 ACC0000000000002
2 ACC0000000000001 ACC0000000169532
3 ACC0000000123005 ACC0000000123005
4 ACC0000000000004 ACC0000000000004
5 ACC0000000000005 ACC0000000000005
6 ACC0000000123005 ACC0000000000040
<小时>
时间
df = pd.concat([df] * 1000000, ignore_index=True)
# jezrael's solution
%%timeit
v = df.sort_values('Account_Number')
v['Account_Number'] = v['Dummy_Account'].mask(v.duplicated('Account_Number')).ffill()
v.sort_index()
315 ms ± 1.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# in this post
%%timeit
m = df.drop_duplicates('Account_Number', keep='first')\
.set_index('Account_Number')\
.Dummy_Account
df.Account_Number.map(m)
163 ms ± 3.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
请注意,性能将取决于您的实际数据.
Note that the performance will depend on your actual data.
这篇关于用第一个唯一值替换其他重复行列并创建查找的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!