解析带有无效节点的XML [英] Parsing XML with invalid nodes

查看:65
本文介绍了解析带有无效节点的XML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我解析的XML太大.当一个节点发生故障时,我想继续循环,并对其余节点进行处理.

I have parsing too big XML. When a node fails I want to keep looping and doing stuff with remaining nodes.

版本1

for event, element in etree.iterparse(file):
    if element.tag == "tag1":
        # Doing some stuff

第一个版本出现异常:

ParseError: not well-formed (invalid token): line 319851

因此,为了处理其余节点,我编写了第二个版本:

So in order to process the remain nodes I have wrote a second version:

版本2

xml_parser = etree.iterparse(file)

while True:
    try:
        event, element = next(xml_parser)

        if element.tag == "tag1":
            # Doing some stuff
        # If there is no more elements to iterate, breaks the loop
        except StopIteration:
            break

        # While another exception, keep looping
        except Exception as e:
            pass 

在这种情况下,脚本会无限循环进入.

In that case the script entering in a infinite loop.

所以我尝试以文本文件的形式转到特定行的开头:

So I tried go to the specific line opening as a text file:

with open(file) as fp:
    for i, line in enumerate(fp):
        if i == 319850:
            print(319850, line)
        if i == 319851:
            print(319851, line)
        if i == 319852:
            print(319852, line)
        if i == 319853:
            print(319853, line)

            break

我得到:

319850    <tag1> <tag11><![CDATA[ foo bar

319851    ]]></tag11></tag1>

319852    <tag1> <tag11><![CDATA[ foo bar]]></tag11></tag1>

319853    <tag1> <tag11><![CDATA[ foo bar]]></tag11></tag1>

因此似乎该行被"\ n"切断.那是一个XML错误,但是为什么我的第二个版本不起作用?在我的第二个版本中,行319850和319851无效,因为XML无效,因此应该传递并获取下一个节点/行.

so seems to be that line is cutted by "\n". That is an XML error but why my second version does not works? In my second version, lines 319850 and 319851 are not valid as XML so should be pass and get the next nodes/lines.

我在这里做错了什么?如果您有最好的方法,请告诉我.

What am I doing wrong here? If you have a best approach please let me know.

更新

XML文件具有无效字符'\ x0b'.所以看起来像:

XML file has an invalid character '\x0b'. So looks like:

<tag1> <tag11><![CDATA[ foo bar '\x0b']]></tag11></tag1>
<tag1> <tag11><![CDATA[ foo bar]]></tag11></tag1>
<tag1> <tag11><![CDATA[ foo bar]]></tag11></tag1>

推荐答案

我已经采用了那些似乎引起麻烦的行,并将它们填充到稍大的xml文件中以进行试用.就是这样.

I have taken those lines that seem to be causing trouble and stuffed them into a slightly bigger xml file for trial purposes. This is it.

<whole>
<tag1>
<tag11>one</tag11>
<tag11><![CDATA[ foo bar
]]></tag11>
<tag11>two</tag11>
<tag11>three</tag11>
</tag1>
<tag1> <tag11><![CDATA[ foo bar
]]></tag11></tag1>
<tag1> <tag11><![CDATA[ foo bar]]></tag11></tag1>
<tag1> <tag11><![CDATA[ foo bar]]></tag11></tag1>
<tag1>
<tag11>three</tag11>
<tag11>four</tag11>
<tag11>five</tag11>
<tag11>six</tag11>
</tag1>
</whole>

然后我运行以下代码,在最后显示其结果.

Then I ran the following code that displayed its results at the end.

>>> import os
>>> os.chdir('c:/scratch')
>>> from lxml import etree
>>> context = etree.iterparse('temp.xml')
>>> for action, elem in context:
...     print (action, elem.tag, elem.sourceline)
...     
end tag11 3
end tag11 4
end tag11 6
end tag11 7
end tag1 2
end tag11 9
end tag1 9
end tag11 11
end tag1 11
end tag11 12
end tag1 12
end tag11 14
end tag11 15
end tag11 16
end tag11 17
end tag1 13
end whole 1

简而言之,这些行似乎没有错.

In short, there seems to be nothing wrong with those lines.

您可以尝试打印在其中找到标签的行号,以便查找xml中出现问题的位置的附近.(这是根据我在SO上新获得的知识进行的编辑.)

You could try printing the line numbers in which tags were found, in order to find the vicinity of the place giving trouble in the xml. (This is an edit based on knowledge that I have newly acquired on SO.)

我也建议使用文档中建议的循环结构来避免无限循环.这就是我在这段代码中所做的.

I would also suggest using the looping structure suggested in the documentation to avoid the infinite loop. That's what I did in this code.

这篇关于解析带有无效节点的XML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆