巨大的xml [英] huge xml

查看:83
本文介绍了巨大的xml的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述




我必须将一个巨大的xml文件导入我们的系统。问题是

xml源可以大于2GB。我想为此目的使用

XmlTextReader类,但它仅限于2GB(这对我来说是有意义的,为什么人们会制作如此大的xml文件? )。

xml文件基本上包含一个很大的产品列表,我想要

逐一阅读。 xml本身指的是dtd。现在我有2个
问题。首先,当然验证将花费很长时间(它需要稍后在服务器环境中运行
)所以我需要一种方法来告诉

XmlTextReader它应该忽略dtd。第二,我不想

验证xml是否格式正确(还需要处理

整个文档),我只需要知道每个''产品''

节点形成良好。现在我的问题。我应该用哪个班级来完成这个?任何想法?


谢谢你,

塞尔吉奥



I have to import a huge xml file into our system. the problem is that
the xml source can be larger than 2GB. I wanted to use the
XmlTextReader class for this purpose but it''s limited to 2GB (which
makes sense to me, why would people make so big xml files?).
The xml file basically contains a big list of products, which I want
to read one by one. The xml itself is referring to a dtd. Now I have 2
problems. First, of course that the validation will take ages (it
needs to run later on a server enviroment) so I need a way to tell the
XmlTextReader that it should ignore the dtd. second, I don''t want to
validate that the xml is well formed (also that requires that the
whole document is processed), I only need to know that each ''product''
node is well formed. now my question. which class should I use to
accomplish this? any ideas?

thank you,
Sergio

推荐答案

首先,你在哪里读到XmlTextReader限制为2GB?我不能想象为什么会这样,因为XmlTextReader不会读取

整个文档以检查格式是否合格。你的

声明,读者必须阅读整个文件以检查

这些事情是假的:据我所知,XmlTextReader读取了

逐行记录,甚至逐个字符记住,仅记住
足够的状态来执行验证。这个文件的大小不应该是什么。


(如果没有意义的话,可以这样想:XmlTextReader读取

文件一点一点,只知道文件是

格式良好且有效_到目前为止读到的点_。


我只能想到解析器必须要记住很多过去历史才能验证的情况,这就是

你正在使用XML中的键和引用,并且解析器想要

保证键是唯一的并且引用匹配。

因为DTD可以'没有表达钥匙和参考,你在那里安全。


此外,你总是可以指示读者不要验证,尽管你可以'

'不要告诉它不要检查是否有良好的形成(这肯定是b / b $不要求整个文件在手)。


我写了一些东西我称之为XmlFragme在XmlTextReader之上的ntReader

做你想做的事情:它以块的形式读取XML文档,每个块(在你的情况下每个产品)返回

作为XmlDocument(a DOM树)。

我的设计正是为了解析像你这样的文件。


然而,我仍然想知道你在哪里阅读XmlTextReader可以

最多只能处理2GB的XML。我总是容易被证明是错的。 :)

First, where did you read that XmlTextReader is limited to 2GB? I can''t
imagine why this would be so, because XmlTextReader does not read the
entire document in order to check for well-formedness or validity. Your
statement that the reader has to read the entire document to check for
these things is false: so far as I know, XmlTextReader reads the
document line-by-line, or even character-by-character, remembering only
enough state to perform validation. The size of the file should not
matter.

(If that didn''t make sense, think of it this way: XmlTextReader reads
the document bit by bit, and only knows that the document is
well-formed and valid _up to the point it''s read so far_.)

I can think of only one situation in which a parser would have to
remember a lot of past history in order to validate, and that is if
you''re using keys and references in XML, and the parser wants to
guarantee that the keys are unique and that the references match up.
Since DTDs can''t express keys and references, you''re safe there.

Besides, you can always instruct the reader not to validate, although
you can''t tell it not to check for well-formedness (which definitely
_doesn''t_ require having the entire document in hand).

I wrote something I call an XmlFragmentReader on top of XmlTextReader
that does what you want: it reads an XML document in chunks, returning
each chunk (in your case each product) as an XmlDocument (a DOM tree).
I designed it precisely for parsing documents like yours.

However, I''m still wondering where you read that XmlTextReader can
handle only up to 2GB of XML. I''m always open to being proven wrong. :)


布鲁斯,



http://msdn.microsoft.com/library /de...hXmlReader.asp


在类表之后,出现以下段落:

注意XmlTextReader和XmlValidatingReader受约束

他们可以阅读的文件大小。他们无法读取大于2千兆字节的文件。

如果可能的话,将源文件拆分成更小的多个文件。

和你一样,我很困惑为什么那里是这样的限制。有没有人知道这个限制是否仍然适用于2.0?


Richard Rosenheim

" Bruce Wood" <峰; br ******* @ canada.com>在消息中写道

news:11 ********************** @ o13g2000cwo.googlegr oups.com ...
Bruce,

See
http://msdn.microsoft.com/library/de...hXmlReader.asp

After the class table, there the following paragraph:
Note The XmlTextReader and XmlValidatingReader are constrained on the
size of files they can read. They cannot read files larger than 2 gigabytes.
If it is possible, split the source file into smaller, multiple files.
Like you, I''m baffled why there is such a limitation. Does anyone know if
this limitation will still be true with 2.0?

Richard Rosenheim
"Bruce Wood" <br*******@canada.com> wrote in message
news:11**********************@o13g2000cwo.googlegr oups.com...
首先,你在哪里读到XmlTextReader限制为2GB?我无法想象为什么会这样,因为XmlTextReader不会读取整个文档以检查格式是否合格。您的
声明读者必须阅读整个文档以检查这些事情是错误的:据我所知,XmlTextReader逐行读取
文档,甚至是字符一个字符,只记得足够的状态来执行验证。文件的大小不应该重要。

(如果没有意义的话,可以这样想:XmlTextReader一点一点地读取文档,并且只知道文档格式良好且有效_到目前为止它的读取点。)

我只能想到解析器必须处理的一种情况
记住了很多过去的历史以便验证,那就是你在XML中使用键和引用,并且解析器想要保证键是唯一的并且引用匹配。
由于DTD不能表达键和引用,你在那里安全。

此外,你总是可以指示读者不要验证,尽管<你不能告诉它不要检查是否形成良好(这绝对不需要整个文件都在手中)。

我写了一些我称之为的东西在XmlTextReader之上的XmlFragmentReader
可以做你想要的:它重新开始以块的形式广告XML文档,将每个块(在您的情况下为每个产品)作为XmlDocument(DOM树)返回。
我精心设计它以解析像您这样的文档。
First, where did you read that XmlTextReader is limited to 2GB? I can''t
imagine why this would be so, because XmlTextReader does not read the
entire document in order to check for well-formedness or validity. Your
statement that the reader has to read the entire document to check for
these things is false: so far as I know, XmlTextReader reads the
document line-by-line, or even character-by-character, remembering only
enough state to perform validation. The size of the file should not
matter.

(If that didn''t make sense, think of it this way: XmlTextReader reads
the document bit by bit, and only knows that the document is
well-formed and valid _up to the point it''s read so far_.)

I can think of only one situation in which a parser would have to
remember a lot of past history in order to validate, and that is if
you''re using keys and references in XML, and the parser wants to
guarantee that the keys are unique and that the references match up.
Since DTDs can''t express keys and references, you''re safe there.

Besides, you can always instruct the reader not to validate, although
you can''t tell it not to check for well-formedness (which definitely
_doesn''t_ require having the entire document in hand).

I wrote something I call an XmlFragmentReader on top of XmlTextReader
that does what you want: it reads an XML document in chunks, returning
each chunk (in your case each product) as an XmlDocument (a DOM tree).
I designed it precisely for parsing documents like yours.

However, I''m still wondering where you read that XmlTextReader can
handle only up to 2GB of XML. I''m always open to being proven wrong. :)



微软希望如何将源文件分割成更小的多个文件。 ?


使用msxsl,它可能基于第一个

位置的XmlTextReader?这将是丰富的:请分割文件,以便

XmlTextReader可以读取它。顺便说一句,分割文件

的唯一方法是使用一个使用XmlTextReader的工具....


我很想知道这个限制是什么,因为我读过的所有

文档以及我所知道的关于XmlTextReader的所有内容

表示它确实_not_不会立即读取整个文件。 br />

塞尔吉奥,


写一个廉价的小应用程序并不是太难了

只是将XML作为文本流读取,识别您的(唯一的)XML

格式,并将文件分解为可管理的块(比如每个n

产品)可以使用XmlTextReader读取。

How, exactly, does Microsoft expect one to "split the source files into
smaller, multiple files"?

Using msxsl, which is probably based on XmlTextReader in the first
place? That would be rich: "Please split the file up so that
XmlTextReader can read it. By the way, the only way to split the file
up is with a tool that uses XmlTextReader...."

I would love to know what that limitation is all about, since all
documentation I''ve read and everything I know about XmlTextReader
indicates that it does _not_ read the entire file at once.

Sergio,

It shouldn''t be too hard to write a cheap-n-nasty little application
that just reads the XML as a text stream, recognizes your (unique) XML
format, and breaks the file into manageable chunks (say every n
products) that you could read using XmlTextReader.


这篇关于巨大的xml的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆