加速“最接近"字符串匹配算法 [英] Speeding up a "closest" string match algorithm

查看:68
本文介绍了加速“最接近"字符串匹配算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在处理一个非常大的位置数据库,并试图将它们与他们的真实世界坐标相匹配.

I am currently processing a very large database of locations and trying to match them with their real world coordinates.

为此,我下载了地名数据集,其中包含许多条目.它给出了可能的名称和纬度/经度坐标.为了尝试加快此过程,我设法通过删除对数据集没有意义的条目,将庞大的csv文件(1.6 GB)减少到0.450 GB.但是,它仍然包含400万个条目.

To achieve this, I have downloaded the geoname dataset which contains a lot of entries. It gives possible names and lat/long coordinates. To try and speed up the process, I have managed to reduce the huge csv file (of 1.6 GB) to 0.450 GB by removing entries that do not make sense for my dataset. It still contains however 4 million entries.

现在我有很多条目,例如:

Now I have many entries such as:

  1. 上周从我在挪威Jotunheimen的营地看到的Slettmark山峰
  2. 在英国苏格兰斯凯岛的仙女格伦(Fairy Glen)冒险
  3. 加利福尼亚州移民荒野中的早晨

知道字符串与这么长的字符串匹配后,我通过NLTK使用了 Standford的NER 得到一个更好的字符串来限定我的位置.现在我有了像这样的字符串:

Knowing that string matching with such long strings, I used Standford's NER via NLTK to get a better string to qualify my location. Now I have strings like:

  1. 挪威佐敦海门山国家公园
  2. 童话苏格兰苏格兰的格伦·斯凯
  3. 加利福尼亚州的旷野移民
  4. 优胜美地国家公园
  5. 半圆顶优胜美地国家公园

地名数据集包含以下内容:

The geoname dataset contains things like:

  1. Jotunheimen挪威拉特隆
  2. Slettmark山峰Jotunheimen挪威叻长岛
  3. 布莱斯峡谷久隆
  4. 半圆顶半长
  5. ...

我正在应用此算法,以便在我的条目和包含4M条目的地理名称csv.我首先阅读geoname_cleaned.csv文件,然后将所有数据放入列表中.然后,对于每个条目,我都调用当前条目和geoname_list的所有条目

And I am applying this algorithm to get a good possible match between my entries and the geoname csv containing 4M entries. I first read the geoname_cleaned.csv file and put all of the data into a list. For each entry I have I then call for each one of my entries string_similarity() between the current entry and all the entries of the geoname_list

def get_bigrams(string):
    """
    Take a string and return a list of bigrams.
    """
    s = string.lower()
    return [s[i:i+2] for i in list(range(len(s) - 1))]

def string_similarity(str1, str2):
    """
    Perform bigram comparison between two strings
    and return a percentage match in decimal form.
    """
    pairs1 = get_bigrams(str1)
    pairs2 = get_bigrams(str2)
    union  = len(pairs1) + len(pairs2)
    hit_count = 0
    for x in pairs1:
        for y in pairs2:
            if x == y:
                hit_count += 1
                break
    return (2.0 * hit_count) / union

我已经在原始数据集的一个子集上测试了该算法,它可以正常工作,但速度显然很慢(单个位置最多需要40秒).由于我有超过一百万个条目要处理,因此这将花费10000个小时或更长时间.我想知道你们是否对如何加快速度有任何想法.我显然想到了并行处理,但是没有可用的HPC解决方案.也许简单的想法可以帮助我加快这一步.

I have tested the algorithm on a subset of my original dataset and it works fine but it is obviously terribly slow (takes up to 40 seconds for a single location). Since I have more than a million entries to process, this will take a good 10000 hours or more. I was wondering if you guys had any idea on how to speed this up. I thought of parallel processing obviously but I don't have any HPC solution available. Perhaps simple ideas could help me speed this up.

我对你们可能有的任何想法都持开放态度,但是会以某种方式更喜欢python兼容的解决方案.

I'm open to any and every idea that you guys might have but would somehow prefer a python-compatible solution.

先谢谢您了:).

我用fuzz.token_set_ratio(s1, s2)尝试了Fuzzywuzzy,它的性能最差(运行时间更糟,结果也不太理想).比赛不如以前使用我的自定义技术好,一次输入的运行时间增加了15秒.

I have tried fuzzywuzzy with fuzz.token_set_ratio(s1, s2) and it gives worst performances (running time is worse, and results are not as good). Matches are not as good as they used to be with my custom technique and running time has increased by a good 15 seconds for a single entry.

尽管在开始时我也使用了某种排序来帮助进行匹配,但是我的幼稚实现无法正常工作.但是我敢肯定,有一些方法可以加快速度,方法是摆脱地理名称数据集中的某些条目,或者以某种方式对它们进行排序.我已经做了很多清理工作,以删除无用的条目,但是不能使该数目远低于4M

I also though of using some kind of sorting at the beginning to help with the matching but my naive implementation would not work. But I'm sure there are some ways to speed this up, by perhaps getting rid of some entries in geoname dataset, or sorting them in some way. I already did a lot of cleaning to remove useless entries, but can't get the number much lower than 4M

推荐答案

我们可以通过两种方式加快匹配速度.我假设在您的代码中,str1是数据集中的名称,而str2是地理名称字符串.为了测试代码,我根据问题中的数据制作了两个微小的数据集.我编写了两个匹配的函数best_matchfirst_match,它们使用您当前的string_similarity函数,因此我们可以看到我的策略给出了相同的结果. best_match检查所有地名字符串&如果超过给定的阈值分数,则返回分数最高的字符串,否则返回None. first_match(可能)更快:它仅返回第一个超出阈值的地理名称字符串,如果找不到则返回None,因此,如果找不到匹配项,则仍必须搜索整个字符串地名列表.

We can speed up the matching in a couple of ways. I assume that in your code str1 is a name from your dataset and str2 is a geoname string. To test the code I made two tiny data sets from the data in your question. And I wrote two matching functions best_match and first_match that use your current string_similarity function so we can see that my strategy gives the same results. best_match checks all geoname strings & returns the string with the highest score if it exceeds a given threshold score, otherwise it returns None. first_match is (potentially) faster: it just returns the first geoname string that exceeds the threshold, or None if it can't find one, so if it doesn't find a match then it still has to search the entire geoname list.

在我的改进版本中,我们为每个str1生成一次二元组,而不是为与之比较的每个str2重新生成str1的二元组.而且,我们会预先计算所有地名二元组,将它们存储在由字符串索引的字典中,这样我们就不必为每个str重新生成它们.另外,我们将地理名称bigrams存储为集合.这使得计算hit_count更快,因为集合成员资格测试比对字符串列表进行线性扫描要快得多. geodict还需要存储每个二元组的长度:一个集合不包含重复项,因此,二元组的长度可能小于二元组的列表,但是我们需要列表长才能正确计算分数.

In my improved version, we generate the bigrams for each str1 once, rather than re-generating the bigrams for str1 for each str2 that we compare it with. And we compute all the geoname bigrams in advance, storing them in a dict indexed by the string so that we don't have to regenerate them for each str. Also, we store the geoname bigrams as sets. That makes computing the hit_count much faster, since set membership testing is much faster than doing a linear scan over a list of strings. The geodict also needs to store the length of each bigram: a set contains no duplicate items, so the length of the set of bigrams may be smaller than the list of bigrams, but we need the list length to compute the score correctly.

# Some fake data
geonames = [
    'Slettmarkmountains Jotunheimen Norway',
    'Fairy Glen Skye Scotland UK',
    'Emigrant Wilderness California',
    'Yosemite National Park',
    'Half Dome Yosemite National Park',
]

mynames = [
    'Jotunheimen Norway',
    'Fairy Glen',
    'Slettmarkmountains Jotunheimen Norway',
    'Bryce Canyon',
    'Half Dome',
]

def get_bigrams(string):
    """
    Take a string and return a list of bigrams.
    """
    s = string.lower()
    return [s[i:i+2] for i in range(len(s) - 1)]

def string_similarity(str1, str2):
    """
    Perform bigram comparison between two strings
    and return a percentage match in decimal form.
    """
    pairs1 = get_bigrams(str1)
    pairs2 = get_bigrams(str2)
    union  = len(pairs1) + len(pairs2)
    hit_count = 0
    for x in pairs1:
        for y in pairs2:
            if x == y:
                hit_count += 1
                break
    return (2.0 * hit_count) / union

# Find the string in geonames which is the best match to str1
def best_match(str1, thresh=0.2):
    score, str2 = max((string_similarity(str1, str2), str2) for str2 in geonames)
    if score < thresh:
        str2 = None
    return score, str2

# Find the 1st string in geonames that matches str1 with a score >= thresh
def first_match(str1, thresh=0.2):
    for str2 in geonames:
        score = string_similarity(str1, str2)
        if score >= thresh:
            return score, str2
    return None

print('Best')
for mystr in mynames:
    print(mystr, ':', best_match(mystr))
print()

print('First')
for mystr in mynames:
    print(mystr, ':', best_match(mystr))
print()

# Put all the geoname bigrams into a dict
geodict = {}
for s in geonames:
    bigrams = get_bigrams(s)
    geodict[s] = (set(bigrams), len(bigrams))

def new_best_match(str1, thresh=0.2):
    pairs1 = get_bigrams(str1)
    pairs1_len = len(pairs1)

    score, str2 = max((2.0 * sum(x in pairs2 for x in pairs1) / (pairs1_len + pairs2_len), str2)
        for str2, (pairs2, pairs2_len) in geodict.items())
    if score < thresh:
        str2 = None
    return score, str2

def new_first_match(str1, thresh=0.2):
    pairs1 = get_bigrams(str1)
    pairs1_len = len(pairs1)

    for str2, (pairs2, pairs2_len) in geodict.items():
        score = 2.0 * sum(x in pairs2 for x in pairs1) / (pairs1_len + pairs2_len)
        if score >= thresh:
            return score, str2
    return None

print('New Best')
for mystr in mynames:
    print(mystr, ':', new_best_match(mystr))
print()

print('New First')
for mystr in mynames:
    print(mystr, ':', new_first_match(mystr))
print()

输出

Best
Jotunheimen Norway : (0.6415094339622641, 'Slettmarkmountains Jotunheimen Norway')
Fairy Glen : (0.5142857142857142, 'Fairy Glen Skye Scotland UK')
Slettmarkmountains Jotunheimen Norway : (1.0, 'Slettmarkmountains Jotunheimen Norway')
Bryce Canyon : (0.1875, None)
Half Dome : (0.41025641025641024, 'Half Dome Yosemite National Park')

First
Jotunheimen Norway : (0.6415094339622641, 'Slettmarkmountains Jotunheimen Norway')
Fairy Glen : (0.5142857142857142, 'Fairy Glen Skye Scotland UK')
Slettmarkmountains Jotunheimen Norway : (1.0, 'Slettmarkmountains Jotunheimen Norway')
Bryce Canyon : (0.1875, None)
Half Dome : (0.41025641025641024, 'Half Dome Yosemite National Park')

New Best
Jotunheimen Norway : (0.6415094339622641, 'Slettmarkmountains Jotunheimen Norway')
Fairy Glen : (0.5142857142857142, 'Fairy Glen Skye Scotland UK')
Slettmarkmountains Jotunheimen Norway : (1.0, 'Slettmarkmountains Jotunheimen Norway')
Bryce Canyon : (0.1875, None)
Half Dome : (0.41025641025641024, 'Half Dome Yosemite National Park')

New First
Jotunheimen Norway : (0.6415094339622641, 'Slettmarkmountains Jotunheimen Norway')
Fairy Glen : (0.5142857142857142, 'Fairy Glen Skye Scotland UK')
Slettmarkmountains Jotunheimen Norway : (1.0, 'Slettmarkmountains Jotunheimen Norway')
Bryce Canyon : None
Half Dome : (0.41025641025641024, 'Half Dome Yosemite National Park')

new_first_match非常简单.线

for str2, (pairs2, pairs2_len) in geodict.items():

循环遍历geodict中的每个项目,以提取每个字符串,二元组集和真实的二元组长度.

loops over every item in geodict extracting each string, bigram set and true bigram length.

sum(x in pairs2 for x in pairs1)

计算pairs1集中的成员中有多少个双字母组.

counts how many of the bigrams in pairs1 are members of the pairs2 set.

因此,对于每个地理名称字符串,我们计算相似性得分,如果其> =阈值(默认值为0.2),则将其返回.您可以为其指定其他默认值thresh,或在调用它时传递thresh.

So for each geoname string, we compute the similarity score and return it if it's >= the threshold, which has a default value of 0.2. You can give it a different default thresh, or pass a thresh when you call it.

new_best_match有点复杂. ;)

((2.0 * sum(x in pairs2 for x in pairs1) / (pairs1_len + pairs2_len), str2)
    for str2, (pairs2, pairs2_len) in geodict.items())

是一个生成器表达式.它循环遍历geodict项,并为每个地理名称字符串创建一个(score, str2)元组.然后,我们将该生成器表达式输入max函数,该函数将返回得分最高的元组.

is a generator expression. It loops over the geodict items and creates a (score, str2) tuple for each geoname string. We then feed that generator expression to the max function, which returns the tuple with the highest score.

这是new_first_match的一个版本,实现了juvian在评论中提出的建议.这样可以节省一点时间.此版本还避免测试任何一个bigram是否为空.

Here's a version of new_first_match that implements the suggestion that juvian made in the comments. It may save a little bit of time. This version also avoids testing if either bigram is empty.

def new_first_match(str1, thresh=0.2):
    pairs1 = get_bigrams(str1)
    pairs1_len = len(pairs1)
    if not pairs1_len:
        return None

    hiscore = 0
    for str2, (pairs2, pairs2_len) in geodict.items():
        if not pairs2_len:
            continue
        total_len = pairs1_len + pairs2_len
        bound = 2.0 * pairs1_len / total_len
        if bound >= hiscore:
            score = 2.0 * sum(x in pairs2 for x in pairs1) / total_len
            if score >= thresh:
                return score, str2
            hiscore = max(hiscore, score)
    return None

一个更简单的变体是不用费心计算hiscore&只需将boundthresh进行比较.

A simpler variation is to not bother computing hiscore & just compare bound to thresh.

这篇关于加速“最接近"字符串匹配算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆