删除有拼写错误的 Pandas 数据帧的最有效方法是什么? [英] What is the most efficient way to dedupe a Pandas dataframe that has typos?

查看:57
本文介绍了删除有拼写错误的 Pandas 数据帧的最有效方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个需要删除重复数据的名称和地址数据框.问题是其中一些字段可能有拼写错误,即使它们仍然是重复的.例如,假设我有这个数据框:

I have a dataframe of names and addresses that I need to dedupe. The catch is that some of these fields might have typos, even though they are still duplicates. For example, suppose I had this dataframe:

  index  name          zipcode
-------  ----------  ---------
      0  john doe        12345
      1  jane smith      54321
      2  john dooe       12345
      3  jane smtih      54321

拼写错误可能出现在姓名或邮政编码中,但让我们只关心这个问题的名字.显然 0 和 2 和 1 和 3 一样是重复的.但是计算出最有效的方法是什么?

The typos could occur in either name or zipcode, but let's just worry about the name one for this question. Obviously 0 and 2 are duplicates as are 1 and 3. But what is the computationally most efficient way to figure this out?

我一直在使用 Levenshtein 距离来计算 fuzzywuzzy 包 中两个字符串之间的距离,当数据框很小时效果很好,我可以通过以下方式遍历它:

I have been using the Levenshtein distance to calculate the distance between two strings from the fuzzywuzzy package, which works great when the dataframe is small and I can iterate through it via:

from fuzzywuzzy import fuzz

for index, row in df.iterrows():
    for index2, row2 in df.iterrows():
        ratio = fuzz.partial_ratio(row['name'], row2['name'])

        if ratio > 90:  # A good threshold for single character typos on names
            # Do something to declare a match and throw out the duplicate

显然,这不是一种可以很好扩展的方法,不幸的是,我需要对大约 700 万行长的数据帧进行重复数据删除.显然,如果我还需要删除邮政编码中潜在的错别字,情况会变得更糟.是的,我可以用 .itertuples() 做到这一点,这将使我的速度提高约 100 倍,但我是否错过了比这个笨重的 O(n^2) 更明显的东西 解决方案?

Obviously this is not a approach that will scale well and unfortunately I need to dedupe a dataframe that is about 7M rows long. And obviously this gets worse if I also need to dedupe potential typos in the zipcode too. Yes, I could do this with .itertuples(), which would give me a factor of ~100 speed improvement, but am I missing something more obvious than this clunky O(n^2) solution?

是否有更有效的方法可以去重复这些嘈杂的数据?我查看了 dedupe 包,但这需要标记用于监督学习的数据,我没有,也不认为这个包将处理无监督学习.我可以推出自己的无监督文本聚类算法,但如果有现有的更好的方法,我宁愿不必走那么远.

Are there more efficient ways I could go about deduping this noisy data? I have looked into the dedupe package, but that requires labeled data for supervised learning and I don't have any nor am I under the impression that this package will handle unsupervised learning. I could roll my own unsupervised text clustering algorithm, but I would rather not have to go that far if there is an existing, better approach.

推荐答案

pandas-dedupe 可以帮助您完成任务.

the package pandas-dedupe can help you with your task.

pandas-dedupe 的工作原理如下:首先它会要求你标记一堆他最困惑的记录.之后,他使用这些知识来解决重复的实体.就是这样:)

pandas-dedupe works as follows: first it asks you to label a bunch of records he is most confused about. Afterwards, he uses this knowledge to resolve duplicates entitites. And that is it :)

您可以尝试以下操作:

import pandas as pd
from pandas_dedupe import dedupe_dataframe

df = pd.DataFrame.from_dict({'name':['john', 'mark', 'frank', 'jon', 'john'], 'zip':['11', '22', '33', '11', '11']})

dd = dedupe_dataframe(df, ['name', 'zip'], canonicalize=True, sample_size=1)

然后控制台会要求您标记一些示例.如果重复点击y",否则点击n".完成后单击f"完成.然后它将对整个数据帧执行重复数据删除.

The console will then ask you to label some example. If duplicates click 'y', otherwise 'n'. And once done click 'f' for finished. It will then perform deduplication on the entire dataframe.

这篇关于删除有拼写错误的 Pandas 数据帧的最有效方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆