Python正则表达式模块模糊匹配:替换计数不符合预期 [英] Python regex module fuzzy match: substitution count not as expected
问题描述
Python模块 regex 允许模糊匹配.
The Python module regex allows fuzzy matching.
您可以指定允许的替换次数(s),插入(i),删除(d)和总错误(e).
You can specify the allowable number of substitutions (s), insertions (i), deletions (d), and total errors (e).
匹配结果的Fuzzy_counts属性返回一个元组(0,0,0),其中:
The fuzzy_counts property of a match result returns a tuple (0,0,0), where:
match.fuzzy_counts[0] = count for 's'
match.fuzzy_counts[1] = count for 'i'
match.fuzzy_counts[2] = count for 'd'
问题
删除和插入按预期计算,但不计入替换.
Problem
The deletions and insertions are counted as expected, but not the substitutions.
在下面的示例中,唯一的更改是查询中的单个字符已删除,但替换数为6 (如果删除了BESTMATCH选项,则为7).
In the example below, the only change is a single character deleted in the query, yet the substitutions count is 6 (7 if the BESTMATCH option is removed).
如何计算替代人数?
我将感谢任何人都可以向我解释这是如何工作的.
I would be grateful of someone can anyone explain how this works to me.
>>> import regex
>>> reference = "(TATGGGA[CT][GC]AAAG[CT]CT[AC]AA[GA]CCATGTG){s<7,i<3,d<3,e<8}"
>>> query = "TATGGACCAAAGTCTCAAGCCATGTG"
>>> match = regex.search(reference, query, regex.BESTMATCH)
>>> print(match.fuzzy_counts)
(6,0,1)
推荐答案
这是由于regex模块的成本计算中的一个错误所致.它一直存在到正则表达式版本2015.10.05之前,但在下一版本2015.10.22中已得到修复,如下所示:
This was caused by what looks to be a bug in the regex module's cost calculations. It was still present up until regex version 2015.10.05, but was fixed in the next version, 2015.10.22, as shown below:
$ sudo pip3 install regex==2015.10.05
Processing /root/.cache/pip/wheels/24/cb/ae/9653e30c8f801544a645e17d26fa6803aeaf76ad0482663c27/regex-2015.10.5-cp38-cp38-linux_x86_64.whl
Installing collected packages: regex
Successfully installed regex-2015.10.5
$ python3 -c 'import regex; reference = "(TATGGGA[CT][GC]AAAG[CT]CT[AC]AA[GA]CCATGTG){s<7,i<3,d<3,e<8}"; query = "TATGGACCAAAGTCTCAAGCCATGTG"; match = regex.search(reference, query, regex.BESTMATCH);print(match.fuzzy_counts)'
(5, 0, 1)
$ sudo pip3 install regex==2015.10.22
Processing /root/.cache/pip/wheels/60/f6/9a/23e723633e62a79064cb301c54a3b50482b8c690f86c9983ee/regex-2015.10.22-cp38-cp38-linux_x86_64.whl
Installing collected packages: regex
Found existing installation: regex 2015.10.5
Uninstalling regex-2015.10.5:
Successfully uninstalled regex-2015.10.5
Successfully installed regex-2015.10.22
$ python3 -c 'import regex; reference = "(TATGGGA[CT][GC]AAAG[CT]CT[AC]AA[GA]CCATGTG){s<7,i<3,d<3,e<8}"; query = "TATGGACCAAAGTCTCAAGCCATGTG"; match = regex.search(reference, query, regex.BESTMATCH);print(match.fuzzy_counts)'
(0, 0, 1)
鉴于这些日期,我推断修复该错误的提交为 https://bitbucket.org/mrabarnett/mrab-regex/commits/296c1daf86619039c6fe55868e7d861097d01aae ,并有描述
Given these dates, I infer that the commit that fixed the bug was https://bitbucket.org/mrabarnett/mrab-regex/commits/296c1daf86619039c6fe55868e7d861097d01aae, with description
汞问题161:意外的模糊匹配结果
Hg issue 161: Unexpected fuzzy match results
修复了该错误,并进行了一些相关的整理.
Fixed the bug and did some related tidying up.
引用的错误是 https://bitbucket.org/mrabarnett/mrab-regex/issues/161 .
这篇关于Python正则表达式模块模糊匹配:替换计数不符合预期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!