如何规范化名称 [英] How to Normalize Names
问题描述
我正在使用pandas数据框,并且在每个公司都有客户的地方有数据.但是,公司名称略有不同,但最终会影响数据. 示例:
I am using pandas dataframes and I have data where I have customers per company. However, the company titles vary slightly but ultimately affect the data. Example:
Company Customers
AAAB 1,000
AAAB Inc. 900
The AAAB Inc. 20
AAAB the INC 10
我想从具有不同名称的公司的几个不同公司的数据库中获取总客户.知道我应该从哪里开始吗?
I want to get the total customers out of a data base of several different companies with the companies having non-standard names. Any idea where I should start?
推荐答案
我记得阅读此博客关于 fuzzywuzzy库(调查另一个问题),可以这样做:
I remember reading this blog about the fuzzywuzzy library (looking into another question), which can do this:
pip install fuzzywuzzy
您可以使用其partial_ratio函数来模糊匹配"字符串:
You can use its partial_ratio function to "fuzzy match" the strings:
In [11]: from fuzzywuzzy.fuzz import partial_ratio
In [12]: partial_ratio('AAAB', 'the AAAB inc.')
Out[12]: 100
似乎对这次比赛很有信心!
Which seems confident about it being a good match!
In [13]: partial_ratio('AAAB', 'AAPL')
Out[13]: 50
In [14]: partial_ratio('AAAB', 'Google')
Out[14]: 0
我们可以在实际公司列表中获得最佳匹配(假设您拥有):
We can take the best match in the actual company list (assuming you have it):
In [15]: co_list = ['AAAB', 'AAPL', 'GOOG']
In [16]: df.Company.apply(lambda mistyped_co: max(co_list,
key=lambda co: partial_ratio(mistyped_co, co)))
Out[16]:
0 AAAB
1 AAAB
2 AAAB
3 AAAB
Name: Company, dtype: object
我强烈怀疑scikit学习或numpy库中有某些功能可以在大型数据集上更有效地执行此操作...但这应该可以完成工作.
如果您没有公司清单,则可能必须做一些更聪明的事情...
If you don't have the company list you'll probably have to do something more clevererer...
这篇关于如何规范化名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!