如何正确的diff树(即,嵌套字符串列表)? [英] How to correctly diff trees (that is, nested lists of strings)?

查看:177
本文介绍了如何正确的diff树(即,嵌套字符串列表)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在工作中的在线编辑器的数据类型,它包含字符串嵌套列表。需要注意的是流量可以得到不堪,如果我要每一个值改变一次传输的整个结构。因此,为了减少交通事故的,我想申请一个diff工具。问题是:我怎么发现和报告的两棵树的差异?例如:

  [啊,BH,[哈,他,[礼,没有,PZ],KA,[吉,XE]],PO,喜 - >
[啊,BH,[哈,他,[礼,没有,PZ],KA,[抹布,XE]],PO 喜]
 

在那里,唯一的变化是吉 - > 抹布在树上深处。大部分的周围工作的比较工具为平面列表,文件等,但不树。我无法找到特定问题的任何文献。什么是最小的方式报告这样的变化,什么是有效的算法来找到它?

解决方案
  1. 您可以使用任何通用DIFF算法,它不是一个问题,找到准备使用图书馆。
  2. 如果你可以用ZLIB库,我可以建议另一种解决方案。对于一些诀窍,可以使用这个库派两名任何二进制之间非常COM pressed差异,让他们打电话A和B(和差别BC)。

1面:

  1. 在初始化ZLIB流
  2. 的COM preSS A->交流与Z_SNC_FLUSH(我们不需要,所以交流可以被释放)
  3. 的COM preSS B-> BC与Z_SNC_FLUSH
  4. DEINIT ZLIB流

我们COM preSS块A先用特殊的标志,它迫使了ZLib处理和输出的所有数据。不过,这并不重置COM pression状态!当我们的COM preSS块B COM pressor已经知道A的子序列和意志融为一体preSS块B非常有效(如果他们有很多共同点)。卑诗省发送的数据。

2面:

  1. 在初始化ZLIB流
  2. 的COM preSS A->交流与Z_SNC_FLUSH
  3. DEINIT ZLIB流

我们需要DECOM preSS完全相同块,我们COM pressed。这是为什么我们需要交流。

    再次
  1. 在初始化ZLIB流
  2. 在DECOM preSS的Ac-> A和Z_SNC_FLUSH
  3. 在DECOM preSS BC->乙与Z_SNC_FLUSH
  4. DEINIT ZLIB流

现在我们可以DECOM preSS AC-A(我们必须这样做,因为我们做了它的另一面,它有助于DECOM pressor学习块A的所有子序列),最后BC->乙。

这是的zlib有点不寻常和棘手的用法,但在公元前这种情况下,不仅是COM pressed块B,它实际上是块A和B之间融为一体pressed差这将是非常有效的如果ZLIB字典的大小与块A的大小对于数据这将是不那么有效。

巨块可比

I'm working in an online editor for a datatype that consists of nested lists of strings. Note that traffic can get unbearable if I am going to transfer the entire structure every time a single value is changed. So, in order to reduce traffic, I've thought in applying a diff tool. Problem is: how do I find and report the diff of two trees? For example:

["ah","bh",["ha","he",["li","no","pz"],"ka",["kat","xe"]],"po","xi"] ->
["ah","bh",["ha","he",["li","no","pz"],"ka",["rag","xe"]],"po","xi"]

There, the only change is "kat" -> "rag" deep down on the tree. Most of the diff tools around work for flat lists, files, etc, but not trees. I couldn't find any literature on that specific problem. What is the minimal way to report such change, and what is an efficient algorithm to find it out?

解决方案

  1. You can use any general DIFF algorithm, it is not a problem to find ready to use library.
  2. If you can use ZLIB library, I can suggest another solution. With some trick it is possible to use this library to send very compressed difference between two any binaries, let call them A and B (and difference Bc).

Side 1:

  1. Init ZLIB stream
  2. Compress A->Ac with Z_SNC_FLUSH (we don’t need result, so Ac can be freed)
  3. Compress B->Bc with Z_SNC_FLUSH
  4. Deinit ZLIB stream

We compress block A first with special flag which force ZLib to process and output all data. But it doesn’t reset compression state! When we compress block B compressor already knows subsequences of A and will compress block B very efficiently (if they have a lot in common). Bc is the only data to send.

Side 2:

  1. Init ZLIB stream
  2. Compress A->Ac with Z_SNC_FLUSH
  3. Deinit ZLIB stream

We need to decompress exactly same blocks as we compressed. That it why we need Ac.

  1. Init ZLIB stream again
  2. DeCompress Ac->A with Z_SNC_FLUSH
  3. DeCompress Bc->B with Z_SNC_FLUSH
  4. Deinit ZLIB stream

Now we can decompress Ac-A (we have to, because we did it on other side and it helps to decompressor to learn all subsequences of block A) and finally Bc->B.

It is a bit unusual and tricky usage of ZLib, but Bc in this case is not just compressed block B, it is actually compressed difference between block A and B. It will be very efficient if size of ZLIB dictionary is comparable with size of block A. For huge blocks of data it will be not so efficient.

这篇关于如何正确的diff树(即,嵌套字符串列表)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆