如何在python中对数组的列中的项目进行模糊匹配? [英] How do I fuzzy match items in a column of an array in python?
问题描述
我有一系列来自NCAA的球队名称,以及与他们相关的统计信息.学校名称经常被缩短或完全省略,但是在名称的所有变体中通常都有一个共同的元素(例如阿拉巴马州的绯红色浪潮与绯红色浪潮).这些名称均以不特定的顺序包含在数组中.我希望能够通过模糊匹配团队名称来获取团队名称的所有变体,并将所有变体重命名为一个名称.我正在python 2.7中工作,我有一个包含所有数据的numpy数组.任何帮助将不胜感激,因为我以前从未使用过模糊匹配.
I have an array of team names from NCAA, along with statistics associated with them. The school names are often shortened or left out entirely, but there is usually a common element in all variations of the name (like Alabama Crimson Tide vs Crimson Tide). These names are all contained in an array in no particular order. I would like to be able to take all variations of a team name by fuzzy matching them and rename all variants to one name. I'm working in python 2.7 and I have a numpy array with all of the data. Any help would be appreciated, as I have never used fuzzy matching before.
我已经考虑过通过for循环进行模糊匹配,尽管这会令人难以置信地缓慢,但它会将数组列中的每个元素与其他所有元素进行比较,但是我不确定如何构建它.
I have considered fuzzy matching through a for-loop, which would (despite being unbelievably slow) compare each element in the column of the array to every other element, but I'm not really sure how to build it.
当前,我的数组如下所示:
Currently, my array looks like this:
{名称,信息1,信息2,信息3}
{Names , info1, info2, info 3}
该数组的长度为几千行,因此我试图使该程序尽可能高效.
The array is a few thousand rows long, so I'm trying to make the program as efficient as possible.
推荐答案
Levenshtein编辑距离是执行字符串模糊匹配的最常见方法.可在 python-Levenshtein软件包中获得.另一个流行的距离是 Jaro Winkler的距离,也可以在同一软件包中找到.
The Levenshtein edit distance is the most common way to perform fuzzy matching of strings. It is available in the python-Levenshtein package. Another popular distance is Jaro Winkler's distance, also available in the same package.
假设一个简单的数组numpy
数组:
Assuming a simple array numpy
array:
import numpy as np
import Levenshtein as lv
ar = np.array([
'string'
, 'stum'
, 'Such'
, 'Say'
, 'nay'
, 'powder'
, 'hiden'
, 'parrot'
, 'ming'
])
我们定义了辅助函数,以为我们提供一个字符串与数组中所有字符串之间的Levenshtein和Jaro距离的索引.
We define helpers to give us indexes of Levenshtein and Jaro distances, between a string we have and all strings in the array.
def levenshtein(dist, string):
return map(lambda x: x<dist, map(lambda x: lv.distance(string, x), ar))
def jaro(dist, string):
return map(lambda x: x<dist, map(lambda x: lv.jaro_winkler(string, x), ar))
现在,请注意Levenshtein距离是一个以字符数计算的整数值,而Jaro的距离是通常在0到1之间变化的浮点值.让我们使用np.where
进行测试:
Now, note that Levenshtein distance is an integer value counted in number of characters, whilst Jaro's distance is a floating point value that normally varies between 0 and 1. Let's test this using np.where
:
print ar[np.where(levenshtein(3, 'str'))]
print ar[np.where(levenshtein(5, 'str'))]
print ar[np.where(jaro(0.00000001, 'str'))]
print ar[np.where(jaro(0.9, 'str'))]
然后我们得到:
['stum']
['string' 'stum' 'Such' 'Say' 'nay' 'ming']
['Such' 'Say' 'nay' 'powder' 'hiden' 'ming']
['string' 'stum' 'Such' 'Say' 'nay' 'powder' 'hiden' 'parrot' 'ming']
这篇关于如何在python中对数组的列中的项目进行模糊匹配?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!