用第一个唯一值替换其他重复行列并创建查找 [英] Replace other columns of duplicate rows with first unique value and create lookup

查看:63
本文介绍了用第一个唯一值替换其他重复行列并创建查找的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是数据 -

Account_Number  Dummy_Account
1050080713252   ACC0000000000001
1050223213427   ACC0000000000002
1050080713252   ACC0000000169532
1105113502309   ACC0000000123005
1100043521537   ACC0000000000004
1100045301840   ACC0000000000005
1105113502309   ACC0000000000040

1,3Account_Number 中有重复的值.行 4,7 也是如此.我需要用 Dummy_Account 中的相同值替换 Account_Number 中的重复值.所以对于 1050080713252,两行 1,3 应该有相同的虚拟值 ACC0000000000001.但是我不想直接替换,而是想保留原来的映射.

Rows 1,3 have duplicate values in Account_Number. So do rows 4,7. I need to replace the duplicate values in Account_Number with the same values in Dummy_Account. So for 1050080713252, both rows 1,3 should have same dummy values ACC0000000000001. But instead of replacing directly, I want to keep the original mapping.

我的预期输出是 -

Account_Number_Map      Dummy_Account_Original
ACC0000000000001    ACC0000000000001
ACC0000000000002    ACC0000000000002
ACC0000000000001    ACC0000000169532
ACC0000000123005    ACC0000000123005
ACC0000000000004    ACC0000000000004
ACC0000000000005    ACC0000000000005
ACC0000000123005    ACC0000000000040

由于 ACC0000000169532 是重复的 Dummy_Account wrt Account_Number,我想创建一个查找,用 ACC00000000000001 替换它>

Since ACC0000000169532 is the duplicate Dummy_Account w.r.t Account_Number, I want to create a lookup that replaces this with ACC0000000000001

我的尝试

我开始像这样创建一个 dict -

I started with creating a dict like this -

maps = dict(zip(df.Dummy_Account, df.Account_Number))

我想创建一个 dict 将原始 Dummy_Account 值作为 key 和新的 Dummy_Account 值作为但我有点失落.我的数据集很大,所以我也在寻找优化的解决方案.

I figured creating a dict that will have the original Dummy_Account values as key and new Dummy_Account values as value But I am a little lost. My dataset is large so I am also looking at optimized solutions.

推荐答案

使用 drop_duplicates,创建一个您将传递给 map 的系列:

Using drop_duplicates, create a Series which you'll pass to map:

m = df.drop_duplicates('Account_Number', keep='first')\
      .set_index('Account_Number')\
      .Dummy_Account

df.Account_Number = df.Account_Number.map(m)

df

     Account_Number     Dummy_Account
0  ACC0000000000001  ACC0000000000001
1  ACC0000000000002  ACC0000000000002
2  ACC0000000000001  ACC0000000169532
3  ACC0000000123005  ACC0000000123005
4  ACC0000000000004  ACC0000000000004
5  ACC0000000000005  ACC0000000000005
6  ACC0000000123005  ACC0000000000040

<小时>

时间

df = pd.concat([df] * 1000000, ignore_index=True)

# jezrael's solution

%%timeit
v = df.sort_values('Account_Number')
v['Account_Number'] = v['Dummy_Account'].mask(v.duplicated('Account_Number')).ffill()
v.sort_index()

315 ms ± 1.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# in this post

%%timeit
m = df.drop_duplicates('Account_Number', keep='first')\
      .set_index('Account_Number')\
      .Dummy_Account

df.Account_Number.map(m)

163 ms ± 3.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

请注意,性能将取决于您的实际数据.

Note that the performance will depend on your actual data.

这篇关于用第一个唯一值替换其他重复行列并创建查找的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆