如何规范化名称 [英] How to Normalize Names

查看:119
本文介绍了如何规范化名称的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用pandas数据框,并且在每个公司都有客户的地方有数据.但是,公司名称略有不同,但最终会影响数据. 示例:

I am using pandas dataframes and I have data where I have customers per company. However, the company titles vary slightly but ultimately affect the data. Example:

Company    Customers
AAAB       1,000
AAAB Inc.  900
The AAAB Inc.  20
AAAB the INC   10

我想从具有不同名称的公司的几个不同公司的数据库中获取总客户.知道我应该从哪里开始吗?

I want to get the total customers out of a data base of several different companies with the companies having non-standard names. Any idea where I should start?

推荐答案

我记得阅读此博客关于 fuzzywuzzy库(调查另一个问题),可以这样做:

I remember reading this blog about the fuzzywuzzy library (looking into another question), which can do this:

pip install fuzzywuzzy

您可以使用其partial_ratio函数来模糊匹配"字符串:

You can use its partial_ratio function to "fuzzy match" the strings:

In [11]: from fuzzywuzzy.fuzz import partial_ratio

In [12]: partial_ratio('AAAB', 'the AAAB inc.')
Out[12]: 100

似乎对这次比赛很有信心!

Which seems confident about it being a good match!

In [13]: partial_ratio('AAAB', 'AAPL')
Out[13]: 50

In [14]: partial_ratio('AAAB', 'Google')
Out[14]: 0

我们可以在实际公司列表中获得最佳匹配(假设您拥有):

We can take the best match in the actual company list (assuming you have it):

In [15]: co_list = ['AAAB', 'AAPL', 'GOOG']

In [16]: df.Company.apply(lambda mistyped_co: max(co_list, 
                                                  key=lambda co: partial_ratio(mistyped_co, co)))
Out[16]: 
0    AAAB
1    AAAB
2    AAAB
3    AAAB
Name: Company, dtype: object

我强烈怀疑scikit学习或numpy库中有某些功能可以在大型数据集上更有效地执行此操作...但这应该可以完成工作.

如果您没有公司清单,则可能必须做一些更聪明的事情...

If you don't have the company list you'll probably have to do something more clevererer...

这篇关于如何规范化名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆