在 Python 中检查较长字符串中存在的模糊/近似子字符串? [英] Checking fuzzy/approximate substring existing in a longer string, in Python?

查看:50
本文介绍了在 Python 中检查较长字符串中存在的模糊/近似子字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用像 leveinstein(leveinstein 或 difflib)这样的算法,很容易找到近似匹配.例如

<预><代码>>>>导入差异库>>>difflib.SequenceMatcher(None,"amazing","amaging").ratio()0.8571428571428571

可以通过根据需要决定阈值来检测模糊匹配.

当前需求:在更大的字符串中根据阈值找到模糊子字符串.

例如.

large_string = "thelargemanhatanproject 是曼哈顿城的一个伟大项目"query_string = "曼哈顿"#result = "manhatan","manhattin" 以及它们在 large_string 中的索引

一种蛮力解决方案是生成长度为 N-1 到 N+1(或其他匹配长度)的所有子串,其中 N 是 query_string 的长度,并对其一一使用 levenstein 并查看阈值.

>

python中是否有更好的解决方案,最好是python 2.7中包含的模块,或外部可用的模块.

---------------------更新和解决方案 ----------------

Python regex 模块工作得很好,尽管对于模糊子字符串情况,它比内置的 re 模块慢一点,这是由于额外操作而产生的明显结果.所需的输出是好的,并且可以轻松定义对模糊程度的控制.

<预><代码>>>>导入正则表达式>>>input = "蒙娜丽莎是达芬奇画的">>>regex.search(r'\b(leonardo){e<3}\s+(da)\s+(vinci){e<2}\b',input,flags=regex.IGNORECASE)<regex.Match 对象;跨度=(23, 41),匹配=莱昂纳多·达芬奇',模糊计数=(0, 2, 1)>

解决方案

即将取代 re 的新正则表达式库包含模糊匹配.

https://pypi.python.org/pypi/regex/

模糊匹配语法看起来相当有表现力,但这会给你一个或更少的插入/添加/删除匹配.

导入正则表达式regex.match('(amazing){e<=1}', 'amazing')

Using algorithms like leveinstein ( leveinstein or difflib) , it is easy to find approximate matches.eg.

>>> import difflib
>>> difflib.SequenceMatcher(None,"amazing","amaging").ratio()
0.8571428571428571

The fuzzy matches can be detected by deciding a threshold as needed.

Current requirement : To find fuzzy substring based on a threshold in a bigger string.

eg.

large_string = "thelargemanhatanproject is a great project in themanhattincity"
query_string = "manhattan"
#result = "manhatan","manhattin" and their indexes in large_string

One brute force solution is to generate all substrings of length N-1 to N+1 ( or other matching length),where N is length of query_string, and use levenstein on them one by one and see the threshold.

Is there better solution available in python , preferably an included module in python 2.7 , or an externally available module .

---------------------UPDATE AND SOLUTION ----------------

Python regex module works pretty well, though it is little bit slower than inbuilt re module for fuzzy substring cases, which is an obvious outcome due to extra operations. The desired output is good and the control over magnitude of fuzziness can be easily defined.

>>> import regex
>>> input = "Monalisa was painted by Leonrdo da Vinchi"
>>> regex.search(r'\b(leonardo){e<3}\s+(da)\s+(vinci){e<2}\b',input,flags=regex.IGNORECASE)
<regex.Match object; span=(23, 41), match=' Leonrdo da Vinchi', fuzzy_counts=(0, 2, 1)>

解决方案

The new regex library that's soon supposed to replace re includes fuzzy matching.

https://pypi.python.org/pypi/regex/

The fuzzy matching syntax looks fairly expressive, but this would give you a match with one or fewer insertions/additions/deletions.

import regex
regex.match('(amazing){e<=1}', 'amaging')

这篇关于在 Python 中检查较长字符串中存在的模糊/近似子字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆