根据字词而不是字符匹配更改 [英] match changes by words, not by characters
问题描述
我正在使用 difflib
的 SequenceMatcher
到 get_opcodes()
,然后用 css
突出显示所做的更改,以创建某种类型的Web diff
。
I'm using difflib
's SequenceMatcher
to get_opcodes()
and than highlight the changes with css
to create some kind of web diff
.
首先,设置一个 min_delta
,以便如果整个字符串中只有3个或更多字符不同,则我认为两个字符串是不同的,另一个( delta
表示真实的,遇到的delta,它总结了所有一个字符的变化):
First, I set a min_delta
so that I consider two strings different if only 3 or more characters in the whole string differ, one after another (delta
means a real, encountered delta, which sums up all one-character changes):
matcher = SequenceMatcher(source_str, diff_str)
min_delta = 3
delta = 0
for tag, i1, i2, j1, j2 in matcher.get_opcodes():
if tag == "equal":
continue # nothing to capture here
elif tag == "delete":
if source_str[i1:i2].isspace():
continue # be whitespace-agnostic
else:
delta += (i2 - i1) # delete i2-i1 chars
elif tag == "replace":
if source_str[i1:i2].isspace() or diff_str[j1:j2].isspace():
continue # be whitespace-agnostic
else:
delta += (i2 - i1) # replace i2-i1 chars
elif tag == "insert":
if diff_str[j1:j2].isspace():
continue # be whitespace-agnostic
else:
delta += (j2 - j1) # insert j2-j1 chars
return_value = True if (delta > min_delta) else False
这可以帮助我确定两个字符串是否确实不同。
This helps me to determine, if two strings really differ. Not very efficient, but I didn't think anything better out.
然后,我以相同的方式为两个字符串之间的差异着色:
Then, I colorize the differences between two strings in the same way:
for tag, i1, i2, j1, j2 in matcher.get_opcodes():
if tag == "equal":
# bustling with strings, inserting them in <span>s and colorizing
elif tag == "delete":
# ...
return_value = old_string, new_string
结果看起来很丑陋(蓝色代表替换,绿色代表新的和红色表示已删除,不等于):
And the result looks pretty ugly (blue for replaced, green for new and red for deleted, nothing for equal):
所以,之所以会这样,是因为 SequenceMatcher
匹配每个单个字符。但是我想让它匹配每个单词 (可能还有空格),或者甚至更令人眼花match乱的东西,因为正如您在屏幕截图中所看到的,第一本书实际上是在
So, this is happening because SequenceMatcher
matches every single character. But I want for it to match every single word instead (and probably whitespaces around them), or something even more eye-candy because as you can see on the screenshot, the first book is actually moved on the fourth position.
在我看来,可以使用 isjunk
和 autojunk来完成某些工作
参数,但是我不知道如何写 SequenceMatcher
的 lambda
我的目的。
It seems to me that something could be done with isjunk
and autojunk
parameters of SequenceMatcher
, but I can't figure out how to write lambda
s for my purposes.
因此,我有两个问题:
-
是否可以按单词匹配?是否可以使用
get_opcodes()
和SequenceMatcher
来做?如果不是,可以用什么代替呢?
Is it possible to match by words? Is it possible to do using
get_opcodes()
andSequenceMatcher
? If not, what could by used instead?
好吧,这是必然的,但是:如果可以通过单词匹配,那么我可以摆脱 min_delta
的肮脏骇客,一旦至少一个单词不同,就返回 True
,对吗?
Okay, this is rather a corollary, but nevertheless: if matching by words is possible, then I can get rid of the dirty hacks with min_delta
and return True
as soon as at least one word differs, right?
推荐答案
SequenceMatcher
可以接受列表 str
作为输入。
SequenceMatcher
can accept lists of str
as input.
您可以先将输入拆分为单词,然后使用 SequenceMatcher
帮助您区分单词。那么您的彩色差异将是用文字而不是用字符。
You can first split the input into words, and then use SequenceMatcher
to help you diff words. Then your colored diff would be by words instead of by characters.
>>> def my_get_opcodes(a, b):
... s = SequenceMatcher(None, a, b)
... for tag, i1, i2, j1, j2 in s.get_opcodes():
... print('{:7} a[{}:{}] --> b[{}:{}] {!r:>8} --> {!r}'.format(
... tag, i1, i2, j1, j2, a[i1:i2], b[j1:j2]))
...
>>> my_get_opcodes("qabxcd", "abycdf")
delete a[0:1] --> b[0:0] 'q' --> ''
equal a[1:3] --> b[0:2] 'ab' --> 'ab'
replace a[3:4] --> b[2:3] 'x' --> 'y'
equal a[4:6] --> b[3:5] 'cd' --> 'cd'
insert a[6:6] --> b[5:6] '' --> 'f'
# This is the bad result you currently have.
>>> my_get_opcodes("one two three\n", "ore tree emu\n")
equal a[0:1] --> b[0:1] 'o' --> 'o'
replace a[1:2] --> b[1:2] 'n' --> 'r'
equal a[2:5] --> b[2:5] 'e t' --> 'e t'
delete a[5:10] --> b[5:5] 'wo th' --> ''
equal a[10:13] --> b[5:8] 'ree' --> 'ree'
insert a[13:13] --> b[8:12] '' --> ' emu'
equal a[13:14] --> b[12:13] '\n' --> '\n'
>>> my_get_opcodes("one two three\n".split(), "ore tree emu\n".split())
replace a[0:3] --> b[0:3] ['one', 'two', 'three'] --> ['ore', 'tree', 'emu']
# This may be the result you want.
>>> my_get_opcodes("one two emily three ha\n".split(), "ore tree emily emu haha\n".split())
replace a[0:2] --> b[0:2] ['one', 'two'] --> ['ore', 'tree']
equal a[2:3] --> b[2:3] ['emily'] --> ['emily']
replace a[3:5] --> b[3:5] ['three', 'ha'] --> ['emu', 'haha']
# A more complicated example exhibiting all four kinds of opcodes.
>>> my_get_opcodes("one two emily three yo right end\n".split(), "ore tree emily emu haha yo yes right\n".split())
replace a[0:2] --> b[0:2] ['one', 'two'] --> ['ore', 'tree']
equal a[2:3] --> b[2:3] ['emily'] --> ['emily']
replace a[3:4] --> b[3:5] ['three'] --> ['emu', 'haha']
equal a[4:5] --> b[5:6] ['yo'] --> ['yo']
insert a[5:5] --> b[6:7] [] --> ['yes']
equal a[5:6] --> b[7:8] ['right'] --> ['right']
delete a[6:7] --> b[8:8] ['end'] --> []
您还可以按行区分 ,按书或 by segment 。您只需要准备一个函数即可将整个段落字符串预处理为所需的diff块。
You can also diff by line, by book, or by segments. You only need to prepare a function that can preprocess the whole passage string into desired diff chunks.
例如:
- 要按行差异 -您可能可以使用
splitlines()
- 要与书中的内容进行比较-您可能可以实现剥离
1。
,2的函数。
- 要按段区分 -您可以像这样抛出API
([book_1,author_1,year_1 ,book_2,author_2,...],[book_1,author_1,year_1,book_2,author_2,...])
。然后您的着色将按段 。
- To diff by line - You probably could use
splitlines()
- To diff by book - You probably could implement a function that strips off the
1.
,2.
- To diff by segments - You could throw in the API like this way
([book_1, author_1, year_1, book_2, author_2, ...], [book_1, author_1, year_1, book_2, author_2, ...])
. And then your coloring would be by segment.
这篇关于根据字词而不是字符匹配更改的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!