`difflib.get_close_matches()`是否有替代方法返回索引(列表位置)而不是str列表? [英] Is there an alternative to `difflib.get_close_matches()` that returns indexes (list positions) instead of a str list?
问题描述
我想使用类似 difflib.get_close_matches
之类的东西.但是而不是最相似的字符串,我想获取索引(即列表中的位置).
I want to use something like difflib.get_close_matches
but instead of the most similar strings, I would like to obtain the indexes (i.e. position in the list).
列表的索引更加灵活,因为可以将索引与其他数据结构(与匹配的字符串相关)相关联.
The indexes of the list are more flexible because one can relate the index to other data structures (related to the matched string).
例如,代替:
>>> words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format']
>>> difflib.get_close_matches('Hello', words)
['hello', 'hallo', 'Hallo']
我想要:
>>> difflib.get_close_matches('Hello', words)
[0, 1, 6]
似乎没有参数可获取此结果,difflib.get_close_matches()
是否有替代方法可返回索引?
There doesn't seem to exist a parameter to obtain this result, is there an alternative to difflib.get_close_matches()
that returns the indexes?
我知道我可以使用difflib.SequenceMatcher
,然后将字符串与ratio
(或quick_ratio
)一对一地比较.但是,我担心这会非常低效,因为:
I know I could use difflib.SequenceMatcher
, and then compare the strings one-to-one with ratio
(or quick_ratio
). However, I am afraid that this would be very inefficient, because:
-
我将不得不创建数千个SequenceMatcher对象并进行比较(我希望
get_close_matches
避免使用该类):
编辑:否.我检查了get_close_matches
的源代码,它实际上使用了SequenceMatcher
.
没有截止值(我猜想有一种优化方法可以避免计算所有字符串的比率)
there is no cutoff (I am guessing that there is an optimization that avoids the calculation of the ratio for all the string)
编辑:部分错误.代码get_close_matches
没有任何主要的优化,只是它使用了
EDIT: Partially False. The code is get_close_matches
does not have any major optimizations, except it uses real_quick_ratio
, quick_ratio
and ratio
alltogether. In any case I can easily copy the optimization into my own function. Also I didn't consider that the SequenceMatcher has methods to set the sequences: set_seq1
, set_seq2
, so at least I won't have to create an object each time.
据我了解,所有python库都是C编译的,这将提高性能.
as far as I understand, all python libraries are C compiled and this would increase performance.
编辑:我很确定是这种情况.该函数位于名为cpython的文件夹中.
EDIT: I am quite sure this is the case. The function is in the folder called cpython.
EDIT: There is a small difference (p-value is 0.030198) between executing directly from difflib and copy the function in a file mydifflib.py.
ipdb> timeit.repeat("gcm('hello', _vals)", setup="from difflib import get_close_matches as gcm; _vals=['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format']", number=100000, repeat=10)
[13.230449825001415, 13.126462900007027, 12.965455356999882, 12.955717618009658, 13.066136312991148, 12.935014379996574, 13.082025538009475, 12.943519036009093, 13.149949093989562, 12.970130036002956]
ipdb> timeit.repeat("gcm('hello', _vals)", setup="from mydifflib import get_close_matches as gcm; _vals=['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format']", number=100000, repeat=10)
[13.363269686000422, 13.087718107010005, 13.112324478992377, 13.358293497993145, 13.283965317998081, 13.056695280989516, 13.021098569995956, 13.04310674899898, 13.024205000008806, 13.152750282009947]
尽管如此,它还没有我预期的那么糟糕,除非有人知道另一个库或替代方法,否则我会继续进行下去.
Nevertheless it is not nearly as bad as I expected, I think I will proceed unless anybody know another library or alternative.
推荐答案
I took the source code for get_close_matches
, and modify it in order to return the indexes instead of the string values.
# mydifflib.py
from difflib import SequenceMatcher
from heapq import nlargest as _nlargest
def get_close_matches_indexes(word, possibilities, n=3, cutoff=0.6):
"""Use SequenceMatcher to return a list of the indexes of the best
"good enough" matches. word is a sequence for which close matches
are desired (typically a string).
possibilities is a list of sequences against which to match word
(typically a list of strings).
Optional arg n (default 3) is the maximum number of close matches to
return. n must be > 0.
Optional arg cutoff (default 0.6) is a float in [0, 1]. Possibilities
that don't score at least that similar to word are ignored.
"""
if not n > 0:
raise ValueError("n must be > 0: %r" % (n,))
if not 0.0 <= cutoff <= 1.0:
raise ValueError("cutoff must be in [0.0, 1.0]: %r" % (cutoff,))
result = []
s = SequenceMatcher()
s.set_seq2(word)
for idx, x in enumerate(possibilities):
s.set_seq1(x)
if s.real_quick_ratio() >= cutoff and \
s.quick_ratio() >= cutoff and \
s.ratio() >= cutoff:
result.append((s.ratio(), idx))
# Move the best scorers to head of list
result = _nlargest(n, result)
# Strip scores for the best n matches
return [x for score, x in result]
用法
>>> from mydifflib import get_close_matches_indexes
>>> words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format']
>>> get_close_matches_indexes('hello', words)
[0, 1, 6]
现在,我可以将此索引与字符串的关联数据相关联,而不必搜索字符串.
Now, I can relate this indexes to associated data of the string without having to search back the strings.
这篇关于`difflib.get_close_matches()`是否有替代方法返回索引(列表位置)而不是str列表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!