用两个列值分组并创建唯一的ID [英] Groupby two column values and create a unique id

查看:54
本文介绍了用两个列值分组并创建唯一的ID的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理的数据集看起来非常相似,如下所示,

I am working on this dataset looks very similar like below where,

transaction_id   customer_id   phone           email
             1            19   12345   123@email.com          
             2            19   00001   245@gmail.com
             3         Guest   00001   123@email.com
             4            22   12345   123@email.com
             5            23   78900   678@gmail.com     

根据电话和电子邮件列中使用的类似信息,年龄在19岁以下,来宾和22岁以下的客户实际上是相同的.

The customers under 19, Guest and 22 are actually the same, according to the similar info used in columns phone and email.

只要该客户的客户ID不是唯一的,我的目标就是找到相似的行并分配一个新的唯一客户ID(以创建一个新的唯一customer_id列).

As long as, the customer ids for the customer are not unique, my goal is to find similar rows and assign a new unique customer id (to create a new unique customer_id column).

trans_id   cust_id   phone           email  unique_id
       1        19   12345   123@email.com          1        
       2        19   00001   245@gmail.com          1
       3     Guest   00001   123@email.com          1
       4        22   12345   123@email.com          1
       5        23   78900   678@gmail.com          2

复杂的一面是,我可以通过电子邮件进行分组,也可以通过电子邮件和电话进行分组.但是我无法掌握所有行,例如,总是将交易编号2分配为其他唯一的客户ID.我尝试了这段代码.

The complicated side is, I can groupby email, or I can groupby email and phone. But I couldn’t grasp all rows, for example transaction number 2 is always being assigned as other unique customer id. I tried this code.

 df['unique_id'] = df.groupby(‘phone’).grouper.group_info[0] 

非常感谢您的时间和帮助.

I greatly appreciate your time and help.

推荐答案

这似乎是网络问题,可以在networkx的帮助下解决.我们需要形成通过电子邮件或电话链接的所有cust_id的网络.

This seems like a network problem, which can be solved with the help of networkx. We need to form the network of all cust_ids that are linked by either email or phone.

import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt

# Generate group numbers for unique phones and e-mails
df['p_gp'] = df.groupby('phone').ngroup()
df['e_gp'] = df.groupby('email').ngroup()

# This will create all pair-wise combinations customer_ids within the same `gp`
def gen_pairs(df, gp_col):
    nwk = df[['customer_id', gp_col]].merge(df[['customer_id', gp_col]], on=gp_col).drop(columns=gp_col)
    # Removes duplicates, not necessary and slow
    #nwk = nwk.assign(dummy = nwk.apply(frozenset,1)).drop_duplicates('dummy').drop(columns='dummy')
    return nwk

# All pair-wise combinations of either e-mail or phone
dfn = pd.concat([gen_pairs(df, 'p_gp'), gen_pairs(df, 'e_gp')])

# Create the graph
G = nx.from_pandas_edgelist(dfn, source='customer_id_x', target='customer_id_y')

# Visualize which users are linked:
ax,fig = plt.subplots(figsize=(4,4))
nx.draw(G, node_size=30, with_labels=True, font_size=15, edge_color='#1f77b4')
plt.draw()

我们可以获得单独的组,并创建字典以映射到唯一的ID.

We can get the separate groups and create a dictionary to map to a unique ID.

l = [list(x.nodes()) for x in nx.connected_component_subgraphs(G)]
#[['19', '22', 'Guest'], ['23']]

d = dict((k, i) for i in range(len(l)) for k in l[i])
#{'19': 0, '22': 0, '23': 1, 'Guest': 0}

# Finally map the customer_id with the dictionary
df['unique_id'] = df.customer_id.map(d)

  transaction_id customer_id  phone          email  p_gp  e_gp  unique_id
0              1          19  12345  123@email.com     1     0          0
1              2          19  00001  245@gmail.com     0     1          0
2              3       Guest  00001  123@email.com     0     0          0
3              4          22  12345  123@email.com     1     0          0
4              5          23  78900  678@gmail.com     2     2          1

这篇关于用两个列值分组并创建唯一的ID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆