充分利用difflib更精细的diff文件(或方式进行后处理一个diff来实现同样的事情) [英] Getting more granular diffs from difflib (or a way to post-process a diff to achieve the same thing)

查看:911
本文介绍了充分利用difflib更精细的diff文件(或方式进行后处理一个diff来实现同样的事情)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个页面,使一个小修改它,改变第一的 65 的在本段的 68 的:

Downloading this page and making a minor edit to it, changing the first 65 in this paragraph to 68:

在这里输入的形象描述

然后我用解析 BeauifulSoup两个源 difflib 不同他们。

I then parse both sources with BeauifulSoup and diff them with difflib.

url = 'https://secure.ssa.gov/apps10/reference.nsf/links/02092016062645AM'
response = urllib2.urlopen(url)
content = response.read()  # get response as list of lines

url2 = 'file:///Users/Pyderman/projects/temp/02092016062645AM-modified.html'
response2 = urllib2.urlopen(url2)
content2 = response2.read()  # get response as list of lines
import difflib
d = difflib.Differ()

diffed = d.compare(content, content)

soup = bs4.BeautifulSoup(content, "lxml")
soup2= bs4.BeautifulSoup(content2, "lxml")
diff = d.compare(list(soup.stripped_strings), list(soup2.stripped_strings))
changes = [change for change in diff if change.startswith('-') or  change.startswith('+')]
for change in changes:
    print change

打印修改给出:

- The Achieving a Better Life Experience (ABLE) Act, H.R. 5771, legislation passed on December 19, 2014. It contains a Title II provision that changes the age at which workers compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA).  This provision will apply to any individual who attains age 65 on or after December 19, 2015 (the one year anniversary of enactment of this bill).  Two new Universal Text Identifiers (UTIs), UTI WCP060 and WCP061 were created to comply with this change.
+ The Achieving a Better Life Experience (ABLE) Act, H.R. 5771, legislation passed on December 19, 2014. It contains a Title II provision that changes the age at which workers compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA).  This provision will apply to any individual who attains age 65 on or after December 19, 2015 (the one year anniversary of enactment of this bill).  Two new Universal Text Identifiers (UTIs), UTI WCP060 and WCP061 were created to comply with this change.

所以它打印整个段落,尽管非常轻微的变化。我想这是一件好事,它显示了完整的段落,而不是句子的差异,但我们可以使输出更精细不知何故?因为它的立场,似乎如果我想强调的只是改变了文本,的我必须做这两个几乎相同的字符串的一些额外的增量进行比较。

So it's printing the whole paragraph, despite the very minor change. I suppose it's a good thing that it's showing the diff by the full paragraph rather than by sentence, but can we make the output more granular somehow? As it stands, it seems if I want to highlight just the text that changed, I'll have to do some additional delta comparison of these two almost-identical strings.

推荐答案

您可以使用的 nltk.sent_tokenize() 汤字符串分割成句子:

You can use nltk.sent_tokenize() to split soup strings into sentences:

from nltk import sent_tokenize

sentences = [sentence for string in soup.stripped_strings for sentence in sent_tokenize(string)]
sentences2 = [sentence for string in soup2.stripped_strings for sentence in sent_tokenize(string)]

diff = d.compare(sentences, sentences2)
changes = [change for change in diff if change.startswith('-') or  change.startswith('+')]
for change in changes:
    print(change)

仅打印相应的句子在其中检测到变化:

Prints only an appropriate sentence where the change was detected:

- It contains a Title II provision that changes the age at which workers compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA).
+ It contains a Title II provision that changes the age at which workers compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA).

这篇关于充分利用difflib更精细的diff文件(或方式进行后处理一个diff来实现同样的事情)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆