`difflib.get_close_matches()`是否有替代方法返回索引(列表位置)而不是str列表? [英] Is there an alternative to `difflib.get_close_matches()` that returns indexes (list positions) instead of a str list?

查看:337
本文介绍了`difflib.get_close_matches()`是否有替代方法返回索引(列表位置)而不是str列表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用类似 difflib.get_close_matches 之类的东西.但是而不是最相似的字符串,我想获取索引(即列表中的位置).

I want to use something like difflib.get_close_matches but instead of the most similar strings, I would like to obtain the indexes (i.e. position in the list).

列表的索引更加灵活,因为可以将索引与其他数据结构(与匹配的字符串相关)相关联.

The indexes of the list are more flexible because one can relate the index to other data structures (related to the matched string).

例如,代替:

>>> words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format']
>>> difflib.get_close_matches('Hello', words)
['hello', 'hallo', 'Hallo']

我想要:

>>> difflib.get_close_matches('Hello', words)
[0, 1, 6] 

似乎没有参数可获取此结果,difflib.get_close_matches()是否有替代方法可返回索引?

There doesn't seem to exist a parameter to obtain this result, is there an alternative to difflib.get_close_matches() that returns the indexes?

我知道我可以使用difflib.SequenceMatcher,然后将字符串与ratio(或quick_ratio)一对一地比较.但是,我担心这会非常低效,因为:

I know I could use difflib.SequenceMatcher, and then compare the strings one-to-one with ratio (or quick_ratio). However, I am afraid that this would be very inefficient, because:

  1. 我将不得不创建数千个SequenceMatcher对象并进行比较(我希望get_close_matches避免使用该类):

编辑:否.我检查了get_close_matches源代码,它实际上使用了SequenceMatcher.

没有截止值(我猜想有一种优化方法可以避免计算所有字符串的比率)

there is no cutoff (I am guessing that there is an optimization that avoids the calculation of the ratio for all the string)

编辑:部分错误.代码get_close_matches没有任何主要的优化,只是它使用了

EDIT: Partially False. The code is get_close_matches does not have any major optimizations, except it uses real_quick_ratio, quick_ratio and ratio alltogether. In any case I can easily copy the optimization into my own function. Also I didn't consider that the SequenceMatcher has methods to set the sequences: set_seq1, set_seq2, so at least I won't have to create an object each time.

据我了解,所有python库都是C编译的,这将提高性能.

as far as I understand, all python libraries are C compiled and this would increase performance.

编辑:我很确定是这种情况.该函数位于名为cpython的文件夹中.

EDIT: I am quite sure this is the case. The function is in the folder called cpython.

编辑:直接从difflib执行并复制

EDIT: There is a small difference (p-value is 0.030198) between executing directly from difflib and copy the function in a file mydifflib.py.

ipdb> timeit.repeat("gcm('hello', _vals)", setup="from difflib import get_close_matches as gcm; _vals=['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format']", number=100000, repeat=10)
[13.230449825001415, 13.126462900007027, 12.965455356999882, 12.955717618009658, 13.066136312991148, 12.935014379996574, 13.082025538009475, 12.943519036009093, 13.149949093989562, 12.970130036002956]

ipdb> timeit.repeat("gcm('hello', _vals)", setup="from mydifflib import get_close_matches as gcm; _vals=['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format']", number=100000, repeat=10)
[13.363269686000422, 13.087718107010005, 13.112324478992377, 13.358293497993145, 13.283965317998081, 13.056695280989516, 13.021098569995956, 13.04310674899898, 13.024205000008806, 13.152750282009947]

尽管如此,它还没有我预期的那么糟糕,除非有人知道另一个库或替代方法,否则我会继续进行下去.

Nevertheless it is not nearly as bad as I expected, I think I will proceed unless anybody know another library or alternative.

推荐答案

我获取了

I took the source code for get_close_matches, and modify it in order to return the indexes instead of the string values.

# mydifflib.py
from difflib import SequenceMatcher
from heapq import nlargest as _nlargest

def get_close_matches_indexes(word, possibilities, n=3, cutoff=0.6):
    """Use SequenceMatcher to return a list of the indexes of the best 
    "good enough" matches. word is a sequence for which close matches 
    are desired (typically a string).
    possibilities is a list of sequences against which to match word
    (typically a list of strings).
    Optional arg n (default 3) is the maximum number of close matches to
    return.  n must be > 0.
    Optional arg cutoff (default 0.6) is a float in [0, 1].  Possibilities
    that don't score at least that similar to word are ignored.
    """

    if not n >  0:
        raise ValueError("n must be > 0: %r" % (n,))
    if not 0.0 <= cutoff <= 1.0:
        raise ValueError("cutoff must be in [0.0, 1.0]: %r" % (cutoff,))
    result = []
    s = SequenceMatcher()
    s.set_seq2(word)
    for idx, x in enumerate(possibilities):
        s.set_seq1(x)
        if s.real_quick_ratio() >= cutoff and \
           s.quick_ratio() >= cutoff and \
           s.ratio() >= cutoff:
            result.append((s.ratio(), idx))

    # Move the best scorers to head of list
    result = _nlargest(n, result)

    # Strip scores for the best n matches
    return [x for score, x in result]

用法

>>> from mydifflib import get_close_matches_indexes
>>> words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format']
>>> get_close_matches_indexes('hello', words)
[0, 1, 6] 

现在,我可以将此索引与字符串的关联数据相关联,而不必搜索字符串.

Now, I can relate this indexes to associated data of the string without having to search back the strings.

这篇关于`difflib.get_close_matches()`是否有替代方法返回索引(列表位置)而不是str列表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆