difflib.SequenceMatcher isjunk参数不考虑吗? [英] difflib.SequenceMatcher isjunk argument not considered?
问题描述
在python difflib库中,SequenceMatcher类的行为是否异常,还是我误读了预期的行为?
In the python difflib library, is the SequenceMatcher class behaving unexpectedly, or am I misreading what the supposed behavior is?
为什么在这种情况下isjunk参数似乎没有什么作用?
Why does the isjunk argument seem to not make any difference in this case?
difflib.SequenceMatcher(None, "AA", "A A").ratio() return 0.8
difflib.SequenceMatcher(lambda x: x in ' ', "AA", "A A").ratio() returns 0.8
我的理解是,如果省略空格,则比例应为1.
My understanding is that if space is omitted, the ratio should be 1.
推荐答案
之所以会发生这种情况,是因为ratio
函数在计算比率时会使用总序列的长度,但不会使用isjunk
过滤元素.因此,只要匹配块中的匹配数得到相同的值(带有和不带有isjunk
),比率度量就将相同.
This is happening because the ratio
function uses total sequences' length while calculating the ratio, but it doesn't filter elements using isjunk
. So, as long as the number of matches in the matching blocks results in the same value (with and without isjunk
), the ratio measure will be the same.
由于性能原因,我认为序列没有被isjunk
过滤.
I assume that sequences are not filtered by isjunk
because of performance reasons.
def ratio(self):
"""Return a measure of the sequences' similarity (float in [0,1]).
Where T is the total number of elements in both sequences, and
M is the number of matches, this is 2.0*M / T.
"""
matches = sum(triple[-1] for triple in self.get_matching_blocks())
return _calculate_ratio(matches, len(self.a) + len(self.b))
self.a
和self.b
是传递到SequenceMatcher对象(在您的示例中为"AA"和"AA")的字符串(序列). isjunk
功能lambda x: x in ' '
仅用于确定匹配的块.您的示例非常简单,因此两个调用的结果比率和匹配块相同.
self.a
and self.b
are the strings (sequences) passed to the SequenceMatcher object ("AA" and "A A" in your example). The isjunk
function lambda x: x in ' '
is only used to determine the matching blocks. Your example is quite simple, so the resulting ratio and matching blocks are the same for both calls.
difflib.SequenceMatcher(None, "AA", "A A").get_matching_blocks()
[Match(a=0, b=0, size=1), Match(a=1, b=2, size=1), Match(a=2, b=3, size=0)]
difflib.SequenceMatcher(lambda x: x == ' ', "AA", "A A").get_matching_blocks()
[Match(a=0, b=0, size=1), Match(a=1, b=2, size=1), Match(a=2, b=3, size=0)]
相同的匹配块,比率为:M = 2, T = 6 => ratio = 2.0 * 2 / 6
现在考虑以下示例:
difflib.SequenceMatcher(None, "AA ", "A A").get_matching_blocks()
[Match(a=1, b=0, size=2), Match(a=3, b=3, size=0)]
difflib.SequenceMatcher(lambda x: x == ' ', "AA ", "A A").get_matching_blocks()
[Match(a=0, b=0, size=1), Match(a=1, b=2, size=1), Match(a=3, b=3, size=0)]
现在匹配的块不同,但是比率相同,因为匹配数仍然相等:
isjunk
为无时:M = 2, T = 6 => ratio = 2.0 * 2 / 6
isjunk
是 lambda x: x == ' '
时:M = 1 + 1, T = 6 => ratio = 2.0 * 2 / 6
最后,匹配项的数量不同:
difflib.SequenceMatcher(None, "AA ", "A A ").get_matching_blocks()
[Match(a=1, b=0, size=2), Match(a=3, b=4, size=0)]
difflib.SequenceMatcher(lambda x: x == ' ', "AA ", "A A ").get_matching_blocks()
[Match(a=0, b=0, size=1), Match(a=1, b=2, size=2), Match(a=3, b=4, size=0)]
匹配数不同
isjunk
为无时:M = 2, T = 7 => ratio = 2.0 * 2 / 7
isjunk
是 lambda x: x == ' '
时:M = 1 + 2, T = 6 => ratio = 2.0 * 3 / 7
这篇关于difflib.SequenceMatcher isjunk参数不考虑吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!