大量的XML编辑 [英] Massive number of XML edits

查看:83
本文介绍了大量的XML编辑的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要将一个中等大小的XML文件加载到内存中,对该文件进行许多随机访问修改(可能有数十万个),然后将结果写入STDIO。大多数这些修改将是节点插入/删除,以及文本节点内的字符插入/删除。这些XML文件将足够小,以适应内存,但足够大,我不想保留多个副本。



我正试图解决体系结构/库,我正在寻找建议。



这是我到目前为止所提供的内容 -



I我正在为此寻找理想的XML库,到目前为止,我还没有发现任何似乎符合该法案的东西。这些库通常将节点存储在Haskell列表中,并将文本存储在Haskell Data.Text对象中。这只允许线性节点和文本插入,我相信文本插入将必须对每个插入/删除操作进行全面重写。



我认为存储节点和文本在序列中似乎是要走的路。它支持log(N)插入和删除,并且只需要在每次更改时重写一小部分树。 XML库都不是基于此,所以我将不得不编写自己的库,或者只是使用其他库中的一个来解析,然后将其转换为我自己的形式(鉴于解析XML是多么容易,我几乎就和前者一样快,而不是对所有东西都进行阴影解析)。

我简单地考虑过这种可能性,Haskell可能会出现这种情况不是最好的工具....但后来我意识到可变性在这里并没有提供太多的优势,因为我的修改不是字符替换,而是添加/删除。如果我用C语言写这个,我仍然需要以某种树形结构存储字符串/节点,以避免每次插入/删除时出现大的字节移动。 (实际上,Haskell可能有一些最好的工具来处理这个问题,但如果你觉得有一个更好的语言选择,我会接受更好的语言选择。)



总结 -


  1. Haskell是正确的选择吗?

  2. b $ b
  3. 是否Haskell lib支持快速节点/文本插入/删除(log(N))?

  4. 数据结构来存储项目列表(在我的情况下,节点和字符)快速插入和删除? >解决方案

我会回答我自己的问题 -



我选择用自定义对象来包装Text.XML树,该对象存储节点和文本在Data.Sequence对象中。因为haskell是懒惰的,我相信它只暂时保存内存中的Text.XML数据,逐个节点地存储数据流,然后在我真正开始修改序列树的任何实际工作之前,它被垃圾收集。

(如果这里的某个人可以验证Haskell是如何在内部工作的,但是我已经实现了一些东西,而且性能似乎是合理的,每秒约30k插入/删除,但应该这样做)。


I need to load a mid-sized XML file into memory, make many random access modifications to the file (perhaps hundreds of thousands), then write the result out to STDIO. Most of these modifications will be node insertion/deletions, as well as character insertion/deletions within the text nodes. These XML files will be small enough to fit into memory, but large enough that I won't want to keep multiple copies around.

I am trying to settle on the architecture/libraries and am looking for suggestions.

Here is what I have come up with so far-

I am looking for the ideal XML library for this, and so far, I haven't found anything that seems to fit the bill. The libraries generally store nodes in Haskell lists, and text in Haskell Data.Text objects. This only allows linear Node and Text inserts, and I believe that the Text inserts will have to do full rewrite on every insert/delete.

I think storing both nodes and text in sequences seems to be the way to go.... It supports log(N) inserts and deletes, and only needs to rewrite a small fraction of the tree on each alteration. None of the XML libs are based on this though, so I will have to either write my own lib, or just use one of the other libs to parse then convert it to my own form (given how easy it is to parse XML, I would almost just as soon do the former, rather than have a shadow parse of everything).

I had briefly considered the possibility that this might be a rare case where Haskell might not be the best tool.... But then I realized that mutability doesn't offer much of an advantage here, because my modifications aren't char replacements, but rather add/deletes. If I wrote this in C, I would still need to store the strings/nodes in some sort of tree structure to avoid large byte moves for each insert/delete. (Actually, Haskell probably has some of the best tools to deal with this, but I would be open to suggestions of a better choice of language for this task if you feel there is one).

To summarize-

  1. Is Haskell the right choice for this?

  2. Does any Haskell lib support fast node/text insert/deletes (log(N))?

  3. Is sequence the best data structure to store a list of items (in my case, Nodes and Chars) for fast insert and deletes?

解决方案

I will answer my own question-

I chose to wrap an Text.XML tree with a custom object that stores nodes and text in Data.Sequence objects. Because haskell is lazy, I believe it only temporarily holds the Text.XML data in memory, node by node as the data streams in, then it is garbage collected before I actually start any real work modifying the Sequence trees.

(It would be nice if someone here could verify that this is how Haskell would work internally, but I've implemented things, and the performance seems to be reasonable, not great- about 30k insert/deletes per second, but this should do).

这篇关于大量的XML编辑的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆