Python的difflib SequenceMatcher加快了速度 [英] Python's difflib SequenceMatcher speed up

查看:981
本文介绍了Python的difflib SequenceMatcher加快了速度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用difflib SequenceMatcher(ratio()方法)来定义文本文件之间的相似性.虽然difflib相对来说比较一小部分文本文件相对较快,例如相比较而言,平均10档70 kb的文件(46次比较)大约需要80秒.

I'm using difflib SequenceMatcher (ratio() method) to define similarity between text files. While difflib is relatively fast to compare a small set of text files e.g. 10 files of 70 kb on average comparing to each other (46 comparisons) takes about 80 seconds.

这里的问题是我收集了3000个txt文件(平均75 kb),SequenceMatcher完成比较工作需要多少时间的原始估计是80天!

The issue here is that i have a collection of 3000 txt files (75 kb on average), a raw estimation on how much time SequenceMatcher needs to complete the comparison job is 80 days!

我尝试了"real_quick_ratio()"和"quick_ratio()"方法,但它们不符合我们的需求.

I tried "real_quick_ratio()" and "quick_ratio()" methods, but they don't fit to our needs.

有什么方法可以加快比较过程? 如果没有,是否还有其他更快的方法可以完成这样的任务?即使不是在Python中.

Is there any way to speed up the comparison process? If not, is there any other faster method to do such a task? Even if it is not in Python.

推荐答案

您发现的问题很常见,因为difflib尚未优化.这是我多年来开发用于比较HTML文档的工具时发现的一些技巧.

The issue you're finding is very common, since difflib is not optimized. Here are some tricks I've found over the years while developing a tool that compares HTML documents.

创建两个列表,其中包含每个文件中的行.然后使用列表作为参数调用difflib.SequenceMatcher. SequenceMatcher知道如何处理列表,并且此过程将更快地进行,因为它是逐行而不是逐个字符地完成的.这可能会降低精度.

Create two lists, containing the lines from each file. Then call difflib.SequenceMatcher with the lists as parameters. The SequenceMatcher knows how to handle lists, and the process will be much faster since it is done on a line by line basis instead of char by char. This might reduce the precision.

看看

Take a look at fuzzy_string_cmp.py and diff.py to see how I'm doing exactly this.

有一个很棒的库,名为 diff_match_patch ,可在pypi中使用.该库将在两个字符串之间执行 fast 区分,并返回更改(添加的行,相等的行,删除的行).

There is a great library called diff_match_patch which is available in pypi. The library will perform fast diffs between two strings and return the changes (line added, line equal, line removed).

通过利用 diff_match_patch ,您应该能够创建自己的dmp_quick_ratio功能.

By leveraging diff_match_patch you should be able to create your own dmp_quick_ratio function.

差异中. py ,您会看到我如何使用该库来获取创建dmp_quick_ratio的灵感.

In diff.py you can see how I'm using the library to get inspiration for creating dmp_quick_ratio.

我的测试表明,使用 diff_match_patch 的速度比Python的.

My tests showed that using diff_match_patch was 20 times faster than Python's difflib.

这篇关于Python的difflib SequenceMatcher加快了速度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆