iterparse无法解析字段，而其他类似字段也可以 [英] iterparse fails to parse a field, while other similar ones are fine

查看：122 发布时间：2020/7/23 19:08:58 python xml xml-parsing iterparse

本文介绍了iterparse无法解析字段，而其他类似字段也可以的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用Python的iterparse来解析nessus扫描(.nessus文件)的XML结果.解析意外记录失败，除非已正确解析相似记录.

I use Python's iterparse to parse the XML result of a nessus scan (.nessus file). The parsing fails on unexpected records, wile similar ones have been parsed correctly.

XML文件的一般结构是许多记录，例如以下记录:

The general structure of the XML file is a lot of records like the one below:

<ReportHost>
  <ReportItem>
    <foo>9.3</foo>
    <bar>hello</bar>
  </ReportItem>
  <ReportItem>
     <foo>10.0</foo>
     <bar>world</bar>
</ReportHost>
<ReportHost>
   ...
</ReportHost>

换句话说，很多主机(ReportHost)具有很多要报告的项目(ReportItem)，而后者具有多个特征(foo，bar).我将考虑为每个项目生成一行，并带有其特征.

In other words a lot of hosts (ReportHost) with a lot of items to report (ReportItem), and the latter having several characteristics (foo, bar). I will be looking at generating one line per item, with its characteristics.

在文件中间的简单行中解析失败(在这种情况下，foo是cvss_base_score)

The parsing fails in the middle of the file at a simple line (foo in that case being cvss_base_score)

<cvss_base_score>9.3</cvss_base_score>

解析了大约200条类似的行而没有问题.

while ~200 similar lines have been parsed without problems.

相关代码如下-它设置了上下文标记(inReportHost和inReportEvent，它们告诉我XML文件的严格位置，并根据其分配或打印值).上下文)

The relevant piece of code is below -- it sets context markers (inReportHost and inReportEvent which tell me where in the stricture of the XML file I am in, and either assign or print a value, depending on the context)

import xml.etree.cElementTree as ET
inReportHost = False
inReportItem = False

for event, elem in ET.iterparse("test2.nessus", events=("start", "end")):
    if event == 'start' and elem.tag == "ReportHost":
        inReportHost = True
    if event == 'end' and elem.tag == "ReportHost":
        inReportHost = False
        elem.clear()
    if inReportHost:
        if event == 'start' and elem.tag == 'ReportItem':
            inReportItem = True
            cvss = ''
        if event == 'start' and inReportItem:
            if event == 'start' and elem.tag == 'cvss_base_score':
                cvss = elem.text
        if event == 'end' and elem.tag == 'ReportItem':
            print cvss
            inReportItem = False

cvss有时(在cvss = elem.text分配之后)具有None值，即使在文件中早些时候已经正确地解析了相同的条目.

cvss sometimes has the None value (after the cvss = elem.text assignment), even though identical entries have been parsed properely earlier in the file.

如果我在作业下方添加以下内容

If I add below the assignement something along the lines of

if cvss is None: cvss = "0"

然后解析更多的cvss为其分配适当的值(其他一些则为None).

then the parsing of many further cvss assign their proper values (and some other are None).

使用<ReportHost>...</reportHost>会导致错误的解析并通过程序运行它时-可以正常工作(即，按预期将9.3分配给cvss).

When taking the <ReportHost>...</reportHost> which causes the wrong parsing and running it through the program - it works fine (ie. cvss is assigned 9.3 as expected).

我迷失了我在代码中犯错的地方，因为在拥有大量相似记录的情况下，一些预先处理正确，而某些-则不正确(某些记录是相同的，但处理方式仍不同).对于失败的记录，我也找不到任何特别的地方-早晚相同的记录都可以.

I am lost at where I make a mistake in my code since, withing a large set of similar records, some apre processed correctly and some - not (some of the records are identical, and still are processed differently). I also cannot find anything particular about the records that fail - identical ones earlier and later are fine.

推荐答案

来自 iterparse()文档:

注意:iterparse()仅保证它已看到>"字符起始标签在发出开始"事件时的属性，因此属性为定义，但是text和tail属性的内容是在这一点上是不确定的.这同样适用于元素子元素；它们可能存在也可能不存在.如果您需要一个完全填充的元素，寻找结束"事件.

Note: iterparse() only guarantees that it has seen the ">" character of a starting tag when it emits a "start" event, so the attributes are defined, but the contents of the text and tail attributes are undefined at that point. The same applies to the element children; they may or may not be present. If you need a fully populated element, look for "end" events instead.

完全解析后，仅在结束"事件上丢弃inReport*变量并处理ReportHost.使用ElementTree API从当前ReportHost元素获取必要的信息，例如cvss_base_score.

Drop inReport* variables and process ReportHost only on "end" events when it fully parsed. Use ElementTree API to get necessary info such as cvss_base_score from current ReportHost element.

要保留内存，请执行以下操作:

To preserve memory, do:

import xml.etree.cElementTree as etree

def getelements(filename_or_file, tag):
    context = iter(etree.iterparse(filename_or_file, events=('start', 'end')))
    _, root = next(context) # get root element
    for event, elem in context:
        if event == 'end' and elem.tag == tag:
            yield elem
            root.clear() # preserve memory

for host in getelements("test2.nessus", "ReportHost"):
    for cvss_el in host.iter("cvss_base_score"):
        print(cvss_el.text)

这篇关于iterparse无法解析字段，而其他类似字段也可以的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

iterparse无法解析字段，而其他类似字段也可以 [英] iterparse fails to parse a field, while other similar ones are fine

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

iterparse无法解析字段，而其他类似字段也可以 [英] iterparse fails to parse a field, while other similar ones are fine

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭