如何将数据差异(可能是JSON)推送到服务器? [英] How to push diffs of data (possibly JSON) to a server?

查看:119
本文介绍了如何将数据差异(可能是JSON)推送到服务器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将定期将一组基于文本的数据从网页推送到服务器,可能是JSON。

I am going to be periodically pushing a set of text-based data from a web-page to a server, probably as JSON.

对于每次推送,没有,部分或全部数据可能已更改。为了减少我必须通过线路发送的数据量,我希望只在每次推送中发送变化的差异。

For every push, none, some or all of the data may have changed. To reduce the amount of data I have to send over the wire I would want to only send a diff of the changes in each push.

你知道任何预先制作解决方案/工具/库:

Do you know of any pre-made solutions / tools / libraries that:


  • 动态构建JSON的差异,因为对其进行了更改(避免存储使用JavaScript编写的oldJson和newJson以及每次推送完全差异(即用于客户端)

  • 使用JSON diff修补现有的JSON块在服务器端,写在任何非Java或.NET的平台上^(需要在linux上运行,Java不是我所使用的env的选项,也不是Mono)。

  • Dynamically build a diff of JSON as changes are made to it (to avoid storing oldJson and newJson and doing a full diff every push) written in JavaScript (i.e. for the client-side)
  • Patch an existing chunk of JSON with a JSON diff on the server side, written on any platform that isn't Java or .NET^ (needs to run on linux, Java's not an option for the env I'm in, neither is Mono).

此外,这是解决这一特定问题的最佳方法吗? 是否有更好的方法来推送大量文本数据?

Moreover, is this even the best way of going about this particular problem? Is there a better way to push chunks of text data around?

编辑:一些说明:


  • 可能的数据结构基本上是相当平坦的(在某种意义上它是高度连接的,所以任何链接都是基于ID的引用而不是实际的嵌套数据)节点集合。节点包含树的集合,这些树的叶子包含实际的主要数据,例如数字,字符串和ID。大多数数据更改将在叶子中。


    • 大多数叶子数据都非常小(原始数据或少于一段文本),但有些数据会很长(丰富文本的页面) )。

    ^ 您希望随机共享主机。我说的是你的好朋友PHP,Python,PERL,Ruby,那些全能。或者,可以很容易地安装在随机共享主机上。

    ^ That that you would expect on random shared hosting. I'm talking your good friends PHP, Python, PERL, Ruby, those fullas. Or, something that could be easily installed on random shared hosting.

    推荐答案

    这也是我一直在努力的事情。如果其他人提供比我更好的答案,我会非常感兴趣,但暂时...

    This has been something I've been struggling with as well. I'll be keenly interested if anyone else offers a better answer than mine, but for the time being...

    首先关闭 http://www.xn--schler-dya.net/blog/2008 / 01/15 / diffing_json_objects /

    我个人无法让这个图书馆工作,但你的milage可能会有所不同。

    I personally have not been able to get this library to work, but your milage may vary.

    另一种方法是不尝试使用DIFF算法解决问题。它的效率非常低,并且根据问题的不同,您可能会获得更好的性能指标,只需发送整个数据,即使您最终会重复自己。主要是非常小的数据块。显然,随着您需要传输的数据变得越来越大,将会出现转折点,但如果没有某种测量,转折点就不会很明显。这里的诀窍是,你的数据越大,你的差异计算也会越长。转折点仅取决于每种方法生长速度形成的两条线的交点,这两条线的线性或更差都取决于你的差异的实现方式。在最糟糕的情况下,您可能会看到中间有一个岛屿,其中diff会获得更好的性能,但是对于更大的数据集则会再次返回,只需通过网络进行简单的再发送就更好了。

    The other alternative is to not try to solve the problem using the DIFF algorithm. It's quite innefficient, and depending on the problem, you may get better performance metrics just sending the whole blob of data, even if you do end up repeating yourself. This is true mainly of very small chunks of data. Obviously there's going to be a turning point as the data you need to transmit gets larger, but it's not going to be obvious where the turning point is, without some kind of measurement. The trick here, is that the bigger your data gets, the longer your diff calculation is going to take too. The turning point is only determined by the intersection of the two lines formed by each method's rate of growth, both of which are going to be linear or worse, depending on how your diff is implemented. In a worst case scenario, you may see an island in the middle where diff gets better performance, but then crosses back again for even larger data sets, and just plain sending it over the network is better again.

    在尝试diff之前的下一步是将数据访问包装在get,set和delete方法中,以跟踪正在进行的更改。您通过网络发送的数据基本上是这些方法使用的顺序日志,您可以在每次成功传输时从客户端刷新。在服务器端,您可以将此日志应用于服务器端数据,并使用与数据访问方法类似的服务器类型。这是一种比不需要太多处理能力的差异稍微更轻的解决方案。

    Next stop before trying diff, is by wrapping your data access in "get", "set" and "delete" methods that track the changes being made. The data you send over the wire would essentially be a sequential log of these method's usage, which you flush from the client side on each successful transmission. On the serverside you apply this log to your serverside data with serverside analogues to your data access methods. This is a somewhat lighter solution than a diff that doesn't require quite as much processing power.

    最后,如果您要做差异,我能想到的最有效的方法是,如果您可以将数据集分解为离散的块,每个都有一个唯一的ID。然后当你运行差异时,差异的过程正好在块级别。也就是说,您所做的唯一比较是ID到ID。如果您更改了一个块,请为其添加一个新ID。你能负担得起diff算法的花费越少,运行所需的时间就越少。

    Finally, if you're going to do diff, the most efficient way I can think of is if you can break your dataset down into discrete "chunks", each with a unique ID. Then when you run the diff, the courseness of the diff is exactly at the "chunk" level. that is, the only comparisons you'd make is ID to ID. If you change a chunk, give it a new id. The courser you can afford to make the diff algorithm, the less time it will take to run.

    或者,您可以简单地运行差异以检查特定对象是否已更改,并在检测到时立即停止,而不是在更改时分配新ID。更改,并简单地标记要在其整体中重新发送的块,以使用相同的ID更新服务器端的块。如果您的块有一些快速哈希算法可以用来快速建立相等性,那么这可以变得更加高效。

    Alternatively, rather than assigning a new ID on change, you could simply run the diff to check whether a specific object has "changed", stop short as soon as you detect a change, and simply mark that chunk to be re-sent in its entirity, to update the chunk on the server side with the same ID. This could be made even more efficient if you have some kind of quick hashing algorithm for your chunks that you can use to quickly establish equality.

    如果你的块的顺序无关紧要,或者你可以将序列存储为块本身的属性,而不是通过物理序列建模。块,然后您甚至可以通过ID键入您的块。然后发现差异只是列出对象A的键,然后在对象B上查找它们,然后是副Versa。这比真正的diff算法更容易实现,它具有O(a + b)性能(我认为)优于真实diff算法的最坏情况,如果你有可能获得我试图自己实现它,或者实现糟糕的实现。

    If the sequence of your chunks doesn't matter, or if you can store the sequence as a property of the chunks themselves, rather than modeled by the physical sequence of the chunks, then you can even key your chunks by ID. Then discovering the differences is simply a matter of listing the keys of object A, and looking them up on object B, and then Vice Versa. This is much simpler to implement than a "real" diff algorithm, it has O(a+b) performance which( I think ) is better than the worst case scenario for a real diff algorithm, which you're likely to get if you're trying to implement it yourself, or get a bad implementation.

    这篇关于如何将数据差异(可能是JSON)推送到服务器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆