如何规范化名称 [英] How to Normalize Names

查看：119 发布时间：2020/5/18 0:33:44 python pandas nlp normalize

本文介绍了如何规范化名称的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用pandas数据框，并且在每个公司都有客户的地方有数据.但是，公司名称略有不同，但最终会影响数据. 示例:

I am using pandas dataframes and I have data where I have customers per company. However, the company titles vary slightly but ultimately affect the data. Example:

Company    Customers
AAAB       1,000
AAAB Inc.  900
The AAAB Inc.  20
AAAB the INC   10

我想从具有不同名称的公司的几个不同公司的数据库中获取总客户.知道我应该从哪里开始吗?

I want to get the total customers out of a data base of several different companies with the companies having non-standard names. Any idea where I should start?

推荐答案

我记得阅读此博客关于 fuzzywuzzy库(调查另一个问题)，可以这样做:

I remember reading this blog about the fuzzywuzzy library (looking into another question), which can do this:

pip install fuzzywuzzy

您可以使用其partial_ratio函数来模糊匹配"字符串:

You can use its partial_ratio function to "fuzzy match" the strings:

In [11]: from fuzzywuzzy.fuzz import partial_ratio

In [12]: partial_ratio('AAAB', 'the AAAB inc.')
Out[12]: 100

似乎对这次比赛很有信心！

Which seems confident about it being a good match!

In [13]: partial_ratio('AAAB', 'AAPL')
Out[13]: 50

In [14]: partial_ratio('AAAB', 'Google')
Out[14]: 0

我们可以在实际公司列表中获得最佳匹配(假设您拥有):

We can take the best match in the actual company list (assuming you have it):

In [15]: co_list = ['AAAB', 'AAPL', 'GOOG']

In [16]: df.Company.apply(lambda mistyped_co: max(co_list, 
                                                  key=lambda co: partial_ratio(mistyped_co, co)))
Out[16]: 
0    AAAB
1    AAAB
2    AAAB
3    AAAB
Name: Company, dtype: object

我强烈怀疑scikit学习或numpy库中有某些功能可以在大型数据集上更有效地执行此操作...但这应该可以完成工作.

如果您没有公司清单，则可能必须做一些更聪明的事情...

If you don't have the company list you'll probably have to do something more clevererer...

这篇关于如何规范化名称的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何规范化名称 [英] How to Normalize Names

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何规范化名称 [英] How to Normalize Names

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭