Pandas 和 python:按多个字段对数据集进行重复数据删除 [英] Pandas and python: deduplication of dataset by several fields

查看:32
本文介绍了Pandas 和 python:按多个字段对数据集进行重复数据删除的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个公司数据集.每家公司都有纳税人编号、地址、电话和其他一些字段.这是我从 Roméo Després 获取的 Pandas 代码:

I have a dataset of companies. Each company has tax payer number, address, phone and some other fields. Here is a Pandas code I take from Roméo Després:

import pandas as pd

df = pd.DataFrame({
    "tax_id": ["A", "B", "C", "D", "E", "A", "B", "C", "F", "E"],
    "phone": [0, 1, 2, 3, 4, 5, 0, 0, 6, 3],
    "address": ["x", "y", "z", "x", "y", "x", "t", "z", "u", "v"],
})
print(df)

  tax_id  phone address
0      A      0       x
1      B      1       y
2      C      2       z
3      D      3       x
4      E      4       y
5      A      5       x
6      B      0       t
7      C      0       z
8      F      6       u
9      E      3       v

我需要按这些字段对数据集进行重复数据删除,这意味着非唯一公司只能通过这些字段之一进行链接.IE.某些公司在我的列表中绝对是独一无二的,前提是它没有任何关键字段的任何匹配项.如果公司与某个其他实体共享纳税人编号,并且该实体与第三个实体共享地址,则这三个公司都是同一家公司.独特公司的预期产出应该是:

I need to deduplicate the dataset by these fields, meaning that not-unique companies can be linked by only one of these fields. I.e. some company is definitely unique in my list only if it doesn't have ANY matches by ANY of the key fields. If company shares tax payer num with some other entity, and that entity shares address with 3rd one, then all three companies are the same one. Expected output in terms of unique companies should be:

  tax_id  phone address
0      A      0       x
1      B      1       y
2      C      2       z
8      F      6       u

每个重复项的预期输出以及唯一的公司索引应如下所示:

Expected output along with unique company index for each duplicate should look like:

  tax_id  phone address  representative_index
0      A      0       x                     0
1      B      1       y                     1
2      C      2       z                     2
3      D      3       x                     0
4      E      4       y                     1
5      A      5       x                     0
6      B      0       t                     0
7      C      0       z                     0
8      F      6       u                     8
9      E      3       v                     3

在这种情况下,如何使用 python/pandas 过滤掉重复项?

How can I filter out duplicates in this case using python/pandas?

我想到的唯一算法是以下直接方法:

The only algo which comes to my head is the following direct approach:

  1. 我按第一个键对数据集进行分组,将其他键作为集合收集结果数据集
  2. 然后我用第二个键迭代地遍历集合,然后添加到我的分组数据集以获得第一个键新第二个键的某些值值,一遍又一遍地迭代它们.
  3. 最后没有什么可添加的了,我对第三个键重复此操作.

就性能和编码简单性而言,这看起来不太有希望.

This doesn't look very promising in terms of performance and simplicity of coding.

还有其他方法可以通过几个键之一删除重复项吗?

Any other ways for removing duplicates by one of several keys?

推荐答案

您可以使用图形分析库 networkx 解决此问题.

You could solve this using the graph analysis library networkx.

import itertools

import networkx as nx
import pandas as pd


df = pd.DataFrame({
    "tax_id": ["A", "B", "C", "D", "E", "A", "B", "C", "F", "E"],
    "phone": [0, 1, 2, 3, 4, 5, 0, 0, 6, 3],
    "address": ["x", "y", "z", "x", "y", "x", "t", "z", "u", "v"],
})

def iter_edges(df):
    """Yield all relationships between rows."""
    for name, series in df.iteritems():
        for nodes in df.groupby(name).indices.values():
            yield from itertools.combinations(nodes, 2)

def iter_representatives(graph):
    """Yield all elements and their representative."""
    for component in nx.connected_components(graph):
        representative = min(component)
        for element in component:
            yield element, representative


graph = nx.Graph()
graph.add_nodes_from(df.index)
graph.add_edges_from(iter_edges(df))

df["representative_index"] = pd.Series(dict(iter_representatives(graph)))

最后df看起来像:

  tax_id  phone address  representative_index
0      A      0       x                     0
1      B      1       y                     0
2      C      2       z                     0
3      D      3       x                     0
4      E      4       y                     0
5      A      5       x                     0
6      B      0       t                     0
7      C      0       z                     0
8      F      6       u                     8
9      E      3       v                     0

请注意,您可以使用 df.drop_duplicates("representative_index") 获取唯一行:

Note you can go df.drop_duplicates("representative_index") to obtain unique rows:

  tax_id  phone address  representative_index
0      A      0       x                     0
8      F      6       u                     8

这篇关于Pandas 和 python:按多个字段对数据集进行重复数据删除的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆