如何在python中对数组的列中的项目进行模糊匹配? [英] How do I fuzzy match items in a column of an array in python?

查看：1200 发布时间：2020/6/15 19:29:07 python-2.7 fuzzy-comparison

本文介绍了如何在python中对数组的列中的项目进行模糊匹配?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一系列来自NCAA的球队名称，以及与他们相关的统计信息.学校名称经常被缩短或完全省略，但是在名称的所有变体中通常都有一个共同的元素(例如阿拉巴马州的绯红色浪潮与绯红色浪潮).这些名称均以不特定的顺序包含在数组中.我希望能够通过模糊匹配团队名称来获取团队名称的所有变体，并将所有变体重命名为一个名称.我正在python 2.7中工作，我有一个包含所有数据的numpy数组.任何帮助将不胜感激，因为我以前从未使用过模糊匹配.

I have an array of team names from NCAA, along with statistics associated with them. The school names are often shortened or left out entirely, but there is usually a common element in all variations of the name (like Alabama Crimson Tide vs Crimson Tide). These names are all contained in an array in no particular order. I would like to be able to take all variations of a team name by fuzzy matching them and rename all variants to one name. I'm working in python 2.7 and I have a numpy array with all of the data. Any help would be appreciated, as I have never used fuzzy matching before.

我已经考虑过通过for循环进行模糊匹配，尽管这会令人难以置信地缓慢，但它会将数组列中的每个元素与其他所有元素进行比较，但是我不确定如何构建它.

I have considered fuzzy matching through a for-loop, which would (despite being unbelievably slow) compare each element in the column of the array to every other element, but I'm not really sure how to build it.

当前，我的数组如下所示:

Currently, my array looks like this:

{名称，信息1，信息2，信息3}

{Names , info1, info2, info 3}

该数组的长度为几千行，因此我试图使该程序尽可能高效.

The array is a few thousand rows long, so I'm trying to make the program as efficient as possible.

推荐答案

Levenshtein编辑距离是执行字符串模糊匹配的最常见方法.可在 python-Levenshtein软件包中获得.另一个流行的距离是 Jaro Winkler的距离，也可以在同一软件包中找到.

The Levenshtein edit distance is the most common way to perform fuzzy matching of strings. It is available in the python-Levenshtein package. Another popular distance is Jaro Winkler's distance, also available in the same package.

假设一个简单的数组numpy数组:

Assuming a simple array numpy array:

import numpy as np
import Levenshtein as lv

ar = np.array([
      'string'
    , 'stum'
    , 'Such'
    , 'Say'
    , 'nay'
    , 'powder'
    , 'hiden'
    , 'parrot'
    , 'ming'
    ])

我们定义了辅助函数，以为我们提供一个字符串与数组中所有字符串之间的Levenshtein和Jaro距离的索引.

We define helpers to give us indexes of Levenshtein and Jaro distances, between a string we have and all strings in the array.

def levenshtein(dist, string):
    return map(lambda x: x<dist, map(lambda x: lv.distance(string, x), ar))

def jaro(dist, string):
    return map(lambda x: x<dist, map(lambda x: lv.jaro_winkler(string, x), ar))

现在，请注意Levenshtein距离是一个以字符数计算的整数值，而Jaro的距离是通常在0到1之间变化的浮点值.让我们使用np.where进行测试:

Now, note that Levenshtein distance is an integer value counted in number of characters, whilst Jaro's distance is a floating point value that normally varies between 0 and 1. Let's test this using np.where:

print ar[np.where(levenshtein(3, 'str'))]
print ar[np.where(levenshtein(5, 'str'))]
print ar[np.where(jaro(0.00000001, 'str'))]
print ar[np.where(jaro(0.9, 'str'))]

然后我们得到:

['stum']
['string' 'stum' 'Such' 'Say' 'nay' 'ming']
['Such' 'Say' 'nay' 'powder' 'hiden' 'ming']
['string' 'stum' 'Such' 'Say' 'nay' 'powder' 'hiden' 'parrot' 'ming']

这篇关于如何在python中对数组的列中的项目进行模糊匹配?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在python中对数组的列中的项目进行模糊匹配? [英] How do I fuzzy match items in a column of an array in python?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何在python中对数组的列中的项目进行模糊匹配? [英] How do I fuzzy match items in a column of an array in python?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭