为Python 2/3实现Google的DiffMatchPatch API [英] Implementing Google's DiffMatchPatch API for Python 2/3

查看:164
本文介绍了为Python 2/3实现Google的DiffMatchPatch API的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 Google的Diff Match Patch用Python编写一个简单的diff应用程序API
我是Python的新手,所以我想举一个例子,说明如何使用Diff Match Patch API在语义上比较两段文本。我不太确定如何使用 diff_match_patch.py​​ 文件以及从中导入什么。

I want to write a simple diff application in Python using Google's Diff Match Patch APIs. I'm quite new to Python, so I want an example of how to use the Diff Match Patch API for semantically comparing two paragraphs of text. I'm not too sure of how to go about using the diff_match_patch.py file and what to import to from it. Help will be much appreciated!

另外,我尝试使用 difflib ,但是我发现它在比较变化很大的句子时无效。我正在使用ubuntu 12.04 x64。

Additionally, I've tried using difflib, but I found it ineffective for comparing largely varied sentences. I'm using ubuntu 12.04 x64.

推荐答案

Google的 差异匹配-修补程序API 对于以Java,JavaScript,Dart,C ++实现的所有语言均相同,C#,Objective C,Lua和Python 2.x或python 3.x)。因此,通常可以使用目标语言以外的其他语言的示例代码来确定各种差异/匹配/补丁任务需要哪些特定的API调用。

Google's diff-match-patch API is the same for all languages that it is implemented in (Java, JavaScript, Dart, C++, C#, Objective C, Lua and Python 2.x or python 3.x). Therefore one can typically use sample snippets in languages other than one's target language to figure out which particular API calls are needed for various diff/match/patch tasks .

一个简单的语义比较,这就是您需要的

In the case of a simple "semantic" comparison this is what you need

import diff_match_patch

textA = "the cat in the red hat"
textB = "the feline in the blue hat"

#create a diff_match_patch object
dmp = diff_match_patch.diff_match_patch()

# Depending on the kind of text you work with, in term of overall length
# and complexity, you may want to extend (or here suppress) the
# time_out feature
dmp.Diff_Timeout = 0   # or some other value, default is 1.0 seconds

# All 'diff' jobs start with invoking diff_main()
diffs = dmp.diff_main(textA, textB)

# diff_cleanupSemantic() is used to make the diffs array more "human" readable
dmp.diff_cleanupSemantic(diffs)

# and if you want the results as some ready to display HMTL snippet
htmlSnippet = dmp.diff_prettyHtml(diffs)



通过diff-match-patch处理 语义一词

注意这种处理对于向观看者呈现差异,因为它倾向于通过避免文本的不相关重新同步(例如,当两个不同的单词恰好在其中间具有共同字母时)来产生较短的差异列表。但是,产生的结果远非完美,因为此处理只是基于差异长度和表面图案等的简单启发式方法,而不是基于词典和其他语义级别设备的实际NLP处理。
例如,上面使用的 textA textB 值会为以下代码生成以下 before-and-after-diff_cleanupSemantic值: diffs 数组


A word on "semantic" processing by diff-match-patch
Beware that such processing is useful to present the differences to a human viewer because it tends to produce a shorter list of differences by avoiding non-relevant resynchronization of the texts (when for example two distinct words happen to have common letters in their mid). The results produced however are far from perfect, as this processing is just simple heuristics based on the length of differences and surface patterns etc. rather than actual NLP processing based on lexicons and other semantic-level devices.
For example, the textA and textB values used above produce the following "before-and-after-diff_cleanupSemantic" values for the diffs array

[(0, 'the '), (-1, 'cat'), (1, 'feline'), (0, ' in the '), (-1, 'r'), (1, 'blu'), (0, 'e'), (-1, 'd'), (0, ' hat')]
[(0, 'the '), (-1, 'cat'), (1, 'feline'), (0, ' in the '), (-1, 'red'), (1, 'blue'), (0, ' hat')]

不错!红色和蓝色通用的字母 e使diff_main()将文本的该区域视为四个编辑,但是cleanupSemantic()仅作为两个编辑而修复,很好地将不同的色块 blue和红色。

Nice! the letter 'e' that is common to red and blue causes the diff_main() to see this area of the text as four edits, but the cleanupSemantic() fixes as just two edits, nicely singling out the different sems 'blue' and 'red'.

但是,例如,如果我们有

However, if we have, for example

textA = "stackoverflow is cool"
textb = "so is very cool"

产生的before / after数组是:

The before/after arrays produced are:

[(0, 's'), (-1, 'tack'), (0, 'o'), (-1, 'verflow'), (0, ' is'), (1, ' very'), (0, ' cool')]
[(0, 's'), (-1, 'tackoverflow is'), (1, 'o is very'), (0, ' cool')]

这表明,与之前相比,在语义上得到改进的 可能会受到不适当的折磨。请注意,例如,如何以匹配的形式保留前导,以及如何将添加的非常一词与很酷表达的一部分混合在一起。理想情况下,我们可能期望

Which shows that the allegedly semantically improved after can be rather unduly "tortured" compared to the before. Note, for example, how the leading 's' is kept as a match and how the added 'very' word is mixed with parts of the 'is cool' expression. Ideally, we'd probably expect something like

[(-1, 'stackoverflow'), (1, 'so'), (0, ' is '), (-1, 'very'), (0, ' cool')]

这篇关于为Python 2/3实现Google的DiffMatchPatch API的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆