如何从文件中有效地解析连接的XML文档 [英] How to efficiently parse concatenated XML documents from a file

查看:127
本文介绍了如何从文件中有效地解析连接的XML文档的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个由连接的有效XML文档组成的文件。我想有效地分离单个XML文档。

I have a file that consists of concatenated valid XML documents. I'd like to separate individual XML documents efficiently.

连接文件的内容将如下所示,因此连接文件本身不是有效的XML文档。

Contents of the concatenated file will look like this, thus the concatenated file is not itself a valid XML document.

<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>

每个单独的XML文档大约1-4 KB,但可能有几百个。所有XML文档都对应于相同的XML Schema。

Each individual XML document around 1-4 KB, but there is potentially a few hundred of them. All XML documents correspond to same XML Schema.

任何建议或工具?我在Java环境中工作。

Any suggestions or tools? I am working in the Java environment.

编辑:我不确定xml声明是否存在于文档中。

I am not sure if the xml-declaration will be present in documents or not.

编辑:我们假设所有xml文档的编码都是UTF-8。

Let's assume that the encoding for all the xml docs is UTF-8.

推荐答案

As Eamon说,如果你知道<?xml>事情将永远存在,只是打破。

As Eamon says, if you know the <?xml> thing will always be there, just break on that.

如果不这样做,请查找结束文档级标记。也就是说,扫描文本计算你的深度。每当您看到以<开头的标记时但不是< /并且不以/>结尾,将深度计数加1。每当您看到标记以< /开头时,减去1.每次减去1,检查您现在是否为零。如果是这样,您已到达XML文档的末尾。

Failing that, look for the ending document-level tag. That is, scan the text counting how many levels deep you are. Every time you see a tag that begins with "<" but not "</" and that does not end with "/>", add 1 to the depth count. Every time you see a tag that begins "</", subtract 1. Every time you subtract 1, check if you are now at zero. If so, you've reached the end of an XML document.

这篇关于如何从文件中有效地解析连接的XML文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆