iterparse无法解析字段,而其他类似字段也可以 [英] iterparse fails to parse a field, while other similar ones are fine
问题描述
我使用Python的iterparse
来解析nessus扫描(.nessus文件)的XML结果.解析意外记录失败,除非已正确解析相似记录.
I use Python's iterparse
to parse the XML result of a nessus scan (.nessus file). The parsing fails on unexpected records, wile similar ones have been parsed correctly.
XML文件的一般结构是许多记录,例如以下记录:
The general structure of the XML file is a lot of records like the one below:
<ReportHost>
<ReportItem>
<foo>9.3</foo>
<bar>hello</bar>
</ReportItem>
<ReportItem>
<foo>10.0</foo>
<bar>world</bar>
</ReportHost>
<ReportHost>
...
</ReportHost>
换句话说,很多主机(ReportHost
)具有很多要报告的项目(ReportItem
),而后者具有多个特征(foo
,bar
).我将考虑为每个项目生成一行,并带有其特征.
In other words a lot of hosts (ReportHost
) with a lot of items to report (ReportItem
), and the latter having several characteristics (foo
, bar
). I will be looking at generating one line per item, with its characteristics.
在文件中间的简单行中解析失败(在这种情况下,foo
是cvss_base_score
)
The parsing fails in the middle of the file at a simple line (foo
in that case being cvss_base_score
)
<cvss_base_score>9.3</cvss_base_score>
解析了大约200条类似的行而没有问题.
while ~200 similar lines have been parsed without problems.
相关代码如下-它设置了上下文标记(inReportHost
和inReportEvent
,它们告诉我XML文件的严格位置,并根据其分配或打印值).上下文)
The relevant piece of code is below -- it sets context markers (inReportHost
and inReportEvent
which tell me where in the stricture of the XML file I am in, and either assign or print a value, depending on the context)
import xml.etree.cElementTree as ET
inReportHost = False
inReportItem = False
for event, elem in ET.iterparse("test2.nessus", events=("start", "end")):
if event == 'start' and elem.tag == "ReportHost":
inReportHost = True
if event == 'end' and elem.tag == "ReportHost":
inReportHost = False
elem.clear()
if inReportHost:
if event == 'start' and elem.tag == 'ReportItem':
inReportItem = True
cvss = ''
if event == 'start' and inReportItem:
if event == 'start' and elem.tag == 'cvss_base_score':
cvss = elem.text
if event == 'end' and elem.tag == 'ReportItem':
print cvss
inReportItem = False
cvss
有时(在cvss = elem.text
分配之后)具有None值,即使在文件中早些时候已经正确地解析了相同的条目.
cvss
sometimes has the None value (after the cvss = elem.text
assignment), even though identical entries have been parsed properely earlier in the file.
如果我在作业下方添加以下内容
If I add below the assignement something along the lines of
if cvss is None: cvss = "0"
然后解析更多的cvss
为其分配适当的值(其他一些则为None).
then the parsing of many further cvss
assign their proper values (and some other are None).
使用<ReportHost>...</reportHost>
会导致错误的解析并通过程序运行它时-可以正常工作(即,按预期将9.3
分配给cvss
).
When taking the <ReportHost>...</reportHost>
which causes the wrong parsing and running it through the program - it works fine (ie. cvss
is assigned 9.3
as expected).
我迷失了我在代码中犯错的地方,因为在拥有大量相似记录的情况下,一些预先处理正确,而某些-则不正确(某些记录是相同的,但处理方式仍不同).对于失败的记录,我也找不到任何特别的地方-早晚相同的记录都可以.
I am lost at where I make a mistake in my code since, withing a large set of similar records, some apre processed correctly and some - not (some of the records are identical, and still are processed differently). I also cannot find anything particular about the records that fail - identical ones earlier and later are fine.
推荐答案
来自 iterparse()文档:
注意:iterparse()仅保证它已看到>"字符 起始标签在发出开始"事件时的属性,因此属性为 定义,但是text和tail属性的内容是 在这一点上是不确定的.这同样适用于元素子元素; 它们可能存在也可能不存在.如果您需要一个完全填充的元素, 寻找结束"事件.
Note: iterparse() only guarantees that it has seen the ">" character of a starting tag when it emits a "start" event, so the attributes are defined, but the contents of the text and tail attributes are undefined at that point. The same applies to the element children; they may or may not be present. If you need a fully populated element, look for "end" events instead.
完全解析后,仅在结束"事件上丢弃inReport*
变量并处理ReportHost.使用ElementTree API从当前ReportHost元素获取必要的信息,例如cvss_base_score
.
Drop inReport*
variables and process ReportHost only on "end" events when it fully parsed. Use ElementTree API to get necessary info such as cvss_base_score
from current ReportHost element.
要保留内存,请执行以下操作:
To preserve memory, do:
import xml.etree.cElementTree as etree
def getelements(filename_or_file, tag):
context = iter(etree.iterparse(filename_or_file, events=('start', 'end')))
_, root = next(context) # get root element
for event, elem in context:
if event == 'end' and elem.tag == tag:
yield elem
root.clear() # preserve memory
for host in getelements("test2.nessus", "ReportHost"):
for cvss_el in host.iter("cvss_base_score"):
print(cvss_el.text)
这篇关于iterparse无法解析字段,而其他类似字段也可以的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!