根据字词而不是字符匹配更改 [英] match changes by words, not by characters

查看:62
本文介绍了根据字词而不是字符匹配更改的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 difflib SequenceMatcher get_opcodes(),然后用 css 突出显示所做的更改,以创建某种类型的Web diff

I'm using difflib's SequenceMatcher to get_opcodes() and than highlight the changes with css to create some kind of web diff.

首先,设置一个 min_delta ,以便如果整个字符串中只有3个或更多字符不同,则我认为两个字符串是不同的,另一个( delta 表示真实的,遇到的delta,它总结了所有一个字符的变化):

First, I set a min_delta so that I consider two strings different if only 3 or more characters in the whole string differ, one after another (delta means a real, encountered delta, which sums up all one-character changes):

matcher = SequenceMatcher(source_str, diff_str)
min_delta = 3
delta = 0

for tag, i1, i2, j1, j2 in matcher.get_opcodes():
    if tag == "equal":
        continue  # nothing to capture here
    elif tag == "delete":
        if source_str[i1:i2].isspace():
            continue  # be whitespace-agnostic
        else:
            delta += (i2 - i1)  # delete i2-i1 chars
    elif tag == "replace":
        if source_str[i1:i2].isspace() or diff_str[j1:j2].isspace():
            continue  # be whitespace-agnostic
        else:
            delta += (i2 - i1)  # replace i2-i1 chars
    elif tag == "insert":
        if diff_str[j1:j2].isspace():
            continue  # be whitespace-agnostic
        else:
            delta += (j2 - j1)  # insert j2-j1 chars

return_value = True if (delta > min_delta) else False

这可以帮助我确定两个字符串是否确实不同。

This helps me to determine, if two strings really differ. Not very efficient, but I didn't think anything better out.

然后,我以相同的方式为两个字符串之间的差异着色:

Then, I colorize the differences between two strings in the same way:

for tag, i1, i2, j1, j2 in matcher.get_opcodes():
    if tag == "equal":
        # bustling with strings, inserting them in <span>s and colorizing
    elif tag == "delete":
        # ...

return_value = old_string, new_string

结果看起来很丑陋(蓝色代表替换,绿色代表新的和红色表示已删除,不等于):

And the result looks pretty ugly (blue for replaced, green for new and red for deleted, nothing for equal):

所以,之所以会这样,是因为 SequenceMatcher 匹配每个单个字符。但是我想让它匹配每个单词 (可能还有空格),或者甚至更令人眼花match乱的东西,因为正如您在屏幕截图中所看到的,第一本书实际上是在

So, this is happening because SequenceMatcher matches every single character. But I want for it to match every single word instead (and probably whitespaces around them), or something even more eye-candy because as you can see on the screenshot, the first book is actually moved on the fourth position.

在我看来,可以使用 isjunk autojunk来完成某些工作 SequenceMatcher 参数,但是我不知道如何写 lambda 我的目的。

It seems to me that something could be done with isjunk and autojunk parameters of SequenceMatcher, but I can't figure out how to write lambdas for my purposes.

因此,我有两个问题


  1. 是否可以按单词匹配?是否可以使用 get_opcodes() SequenceMatcher 来做?如果不是,可以用什么代替呢?

  1. Is it possible to match by words? Is it possible to do using get_opcodes() and SequenceMatcher? If not, what could by used instead?

好吧,这是必然的,但是:如果可以通过单词匹配,那么我可以摆脱 min_delta 的肮脏骇客,一旦至少一个单词不同,就返回 True ,对吗?

Okay, this is rather a corollary, but nevertheless: if matching by words is possible, then I can get rid of the dirty hacks with min_delta and return True as soon as at least one word differs, right?


推荐答案

SequenceMatcher 可以接受列表 str 作为输入。

SequenceMatcher can accept lists of str as input.

您可以先将输入拆分为单词,然后使用 SequenceMatcher 帮助您区分单词。那么您的彩色差异将是用文字而不是用字符

You can first split the input into words, and then use SequenceMatcher to help you diff words. Then your colored diff would be by words instead of by characters.

>>> def my_get_opcodes(a, b):
...     s = SequenceMatcher(None, a, b)
...     for tag, i1, i2, j1, j2 in s.get_opcodes():
...         print('{:7}   a[{}:{}] --> b[{}:{}] {!r:>8} --> {!r}'.format(
...             tag, i1, i2, j1, j2, a[i1:i2], b[j1:j2]))
... 

>>> my_get_opcodes("qabxcd", "abycdf")
delete    a[0:1] --> b[0:0]      'q' --> ''
equal     a[1:3] --> b[0:2]     'ab' --> 'ab'
replace   a[3:4] --> b[2:3]      'x' --> 'y'
equal     a[4:6] --> b[3:5]     'cd' --> 'cd'
insert    a[6:6] --> b[5:6]       '' --> 'f'

# This is the bad result you currently have.
>>> my_get_opcodes("one two three\n", "ore tree emu\n")
equal     a[0:1] --> b[0:1]      'o' --> 'o'
replace   a[1:2] --> b[1:2]      'n' --> 'r'
equal     a[2:5] --> b[2:5]    'e t' --> 'e t'
delete    a[5:10] --> b[5:5]  'wo th' --> ''
equal     a[10:13] --> b[5:8]    'ree' --> 'ree'
insert    a[13:13] --> b[8:12]       '' --> ' emu'
equal     a[13:14] --> b[12:13]     '\n' --> '\n'

>>> my_get_opcodes("one two three\n".split(), "ore tree emu\n".split())
replace   a[0:3] --> b[0:3] ['one', 'two', 'three'] --> ['ore', 'tree', 'emu']

# This may be the result you want.
>>> my_get_opcodes("one two emily three ha\n".split(), "ore tree emily emu haha\n".split())
replace   a[0:2] --> b[0:2] ['one', 'two'] --> ['ore', 'tree']
equal     a[2:3] --> b[2:3] ['emily'] --> ['emily']
replace   a[3:5] --> b[3:5] ['three', 'ha'] --> ['emu', 'haha']

# A more complicated example exhibiting all four kinds of opcodes.
>>> my_get_opcodes("one two emily three yo right end\n".split(), "ore tree emily emu haha yo yes right\n".split())
replace   a[0:2] --> b[0:2] ['one', 'two'] --> ['ore', 'tree']
equal     a[2:3] --> b[2:3] ['emily'] --> ['emily']
replace   a[3:4] --> b[3:5] ['three'] --> ['emu', 'haha']
equal     a[4:5] --> b[5:6]   ['yo'] --> ['yo']
insert    a[5:5] --> b[6:7]       [] --> ['yes']
equal     a[5:6] --> b[7:8] ['right'] --> ['right']
delete    a[6:7] --> b[8:8]  ['end'] --> []

您还可以按行区分 按书 by segment 。您只需要准备一个函数即可将整个段落字符串预处理为所需的diff块。

You can also diff by line, by book, or by segments. You only need to prepare a function that can preprocess the whole passage string into desired diff chunks.

例如:


  • 要按行差异 -您可能可以使用 splitlines()

  • 要与书中的内容进行比较-您可能可以实现剥离 1。 2的函数。

  • 要按段区分 -您可以像这样抛出API ([book_1,author_1,year_1 ,book_2,author_2,...],[book_1,author_1,year_1,book_2,author_2,...])。然后您的着色将按段

  • To diff by line - You probably could use splitlines()
  • To diff by book - You probably could implement a function that strips off the 1., 2.
  • To diff by segments - You could throw in the API like this way ([book_1, author_1, year_1, book_2, author_2, ...], [book_1, author_1, year_1, book_2, author_2, ...]). And then your coloring would be by segment.

这篇关于根据字词而不是字符匹配更改的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆