通过lxml.etree.iterparse解析单个文件中的多个XML声明 [英] Parse several XML declarations in a single file by means of lxml.etree.iterparse

查看:216
本文介绍了通过lxml.etree.iterparse解析单个文件中的多个XML声明的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要解析一个包含各种XML文件的文件,即< xml></xml> < xml></xml> ..依此类推.使用etree.iterparse时,出现以下(正确)错误:

I need to parse a file that contains various XML files, i.e., <xml></xml> <xml></xml> .. and so forth. While using etree.iterparse, I get the following (correct) error:

lxml.etree.XMLSyntaxError: XML declaration allowed only at the start of the document

现在,我可以预处理输入文件,并为每个包含的XML文件生成一个单独的文件.这可能是最简单的解决方案.但我想知道是否存在针对此问题"的适当解决方案.

Now, I can preprocess the input file and produce for each contained XML file a separate file. This might be the easiest solution. But I wonder if a proper solution for this 'problem' exists.

谢谢!

推荐答案

您提供的示例数据提出了一个问题,而您提供的问题和例外提出了另一个问题.您是否有多个串联在一起的XML文档,每个文档都有自己的XML声明,或者您有一个包含多个顶级元素的XML片段?

The sample data you've provided suggests one problem, while the question and the exception you've provided suggests another. Do you have multiple XML documents concatenated together, each with its own XML declaration, or do you have an XML fragment with multiple top-level elements?

如果是前者,那么解决方案将涉及将输入流分成多个流,并分别解析每个流.正如一个注释所暗示的,这并不一定意味着要实现XML解析器.您可以在字符串中搜索XML声明,而不必解析其中的任何其他内容,只要您的输入不包括包含未转义XML声明的CDATA部分.您可以编写一个类似于文件的对象,该对象从基础流返回字符,直到命中XML声明为止,然后将其包装在生成器函数中,该函数将一直返回流,直到到达EOF.这不是微不足道的,但也不是那么困难.

If it's the former, then the solution's going to involve breaking the input stream up into multiple streams, and parsing each one individually. This doesn't necessarily mean, as one comment suggests, implementing an XML parser. You can search a string for XML declarations without having to parse anything else in it, so long as your input doesn't include CDATA sections that contain unescaped XML declarations. You can write a file-like object that returns characters from the underlying stream until it hits an XML declaration, and then wrap it in a generator function that keeps returning streams until EOF is reached. It's not trivial, but it's not hugely difficult either.

如果您的XML片段包含多个顶级元素,则只需将它们包装成一个XML元素并解析整个内容即可.

If you have an XML fragment with multiple top-level elements, you can just wrap them an XML element and parse the whole thing.

当然,与大多数涉及错误XML输入的问题一样,最简单的解决方案可能就是修复产生错误输入的问题.

Of course, as with most problems involving bad XML input, the easiest solution may just be to fix the thing that's producing the bad input.

这篇关于通过lxml.etree.iterparse解析单个文件中的多个XML声明的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆