如何在python中对数组的列中的项目进行模糊匹配? [英] How do I fuzzy match items in a column of an array in python?

查看:1200
本文介绍了如何在python中对数组的列中的项目进行模糊匹配?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一系列来自NCAA的球队名称,以及与他们相关的统计信息.学校名称经常被缩短或完全省略,但是在名称的所有变体中通常都有一个共同的元素(例如阿拉巴马州的绯红色浪潮与绯红色浪潮).这些名称均以不特定的顺序包含在数组中.我希望能够通过模糊匹配团队名称来获取团队名称的所有变体,并将所有变体重命名为一个名称.我正在python 2.7中工作,我有一个包含所有数据的numpy数组.任何帮助将不胜感激,因为我以前从未使用过模糊匹配.

I have an array of team names from NCAA, along with statistics associated with them. The school names are often shortened or left out entirely, but there is usually a common element in all variations of the name (like Alabama Crimson Tide vs Crimson Tide). These names are all contained in an array in no particular order. I would like to be able to take all variations of a team name by fuzzy matching them and rename all variants to one name. I'm working in python 2.7 and I have a numpy array with all of the data. Any help would be appreciated, as I have never used fuzzy matching before.

我已经考虑过通过for循环进行模糊匹配,尽管这会令人难以置信地缓慢,但它会将数组列中的每个元素与其他所有元素进行比较,但是我不确定如何构建它.

I have considered fuzzy matching through a for-loop, which would (despite being unbelievably slow) compare each element in the column of the array to every other element, but I'm not really sure how to build it.

当前,我的数组如下所示:

Currently, my array looks like this:

{名称,信息1,信息2,信息3}

{Names , info1, info2, info 3}

该数组的长度为几千行,因此我试图使该程序尽可能高效.

The array is a few thousand rows long, so I'm trying to make the program as efficient as possible.

推荐答案

Levenshtein编辑距离是执行字符串模糊匹配的最常见方法.可在 python-Levenshtein软件包中获得.另一个流行的距离是 Jaro Winkler的距离,也可以在同一软件包中找到.

The Levenshtein edit distance is the most common way to perform fuzzy matching of strings. It is available in the python-Levenshtein package. Another popular distance is Jaro Winkler's distance, also available in the same package.

假设一个简单的数组numpy数组:

Assuming a simple array numpy array:

import numpy as np
import Levenshtein as lv

ar = np.array([
      'string'
    , 'stum'
    , 'Such'
    , 'Say'
    , 'nay'
    , 'powder'
    , 'hiden'
    , 'parrot'
    , 'ming'
    ])

我们定义了辅助函数,以为我们提供一个字符串与数组中所有字符串之间的Levenshtein和Jaro距离的索引.

We define helpers to give us indexes of Levenshtein and Jaro distances, between a string we have and all strings in the array.

def levenshtein(dist, string):
    return map(lambda x: x<dist, map(lambda x: lv.distance(string, x), ar))

def jaro(dist, string):
    return map(lambda x: x<dist, map(lambda x: lv.jaro_winkler(string, x), ar))

现在,请注意Levenshtein距离是一个以字符数计算的整数值,而Jaro的距离是通常在0到1之间变化的浮点值.让我们使用np.where进行测试:

Now, note that Levenshtein distance is an integer value counted in number of characters, whilst Jaro's distance is a floating point value that normally varies between 0 and 1. Let's test this using np.where:

print ar[np.where(levenshtein(3, 'str'))]
print ar[np.where(levenshtein(5, 'str'))]
print ar[np.where(jaro(0.00000001, 'str'))]
print ar[np.where(jaro(0.9, 'str'))]

然后我们得到:

['stum']
['string' 'stum' 'Such' 'Say' 'nay' 'ming']
['Such' 'Say' 'nay' 'powder' 'hiden' 'ming']
['string' 'stum' 'Such' 'Say' 'nay' 'powder' 'hiden' 'parrot' 'ming']

这篇关于如何在python中对数组的列中的项目进行模糊匹配?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆