XML阅读器似乎忽略了标签层次结构 [英] XML reader seems to ignore tag hierarchy
问题描述
在XML文件中,我试图获取在标签层次结构中的不同级别上多次出现的标签内容.我正在尝试获取标记中出现的最高级别的内容,但是我的XML阅读器(BeautifulSoup for Python)一直在给我错误的内容.
In an XML file, I'm trying to get the content of a tag that appears multiple times at different levels in the tag hierarchy. I'm trying to get the content of the highest level occurrence of the tag, but my XML reader (BeautifulSoup for Python) keeps giving me the wrong content.
这是具体的问题.这是XML文件的一部分(浓缩为我认为相关的部分):
Here is the concrete problem. This is part of the XML file (condensed to the parts I believe are relevant):
<object>
<name>person</name>
<part>
<name>head</name>
<bndbox>
<xmin>337</xmin>
<ymin>2</ymin>
<xmax>382</xmax>
<ymax>66</ymax>
</bndbox>
</part>
<bndbox>
<xmin>334</xmin>
<ymin>1</ymin>
<xmax>436</xmax>
<ymax>373</ymax>
</bndbox>
</object>
我有兴趣通过命令在此代码段的最后获取<bndbox>
标记的内容
I'm interested in getting the content of the <bndbox>
tag at the very end of this snippet via the command
box = object.bndbox
但是,如果我打印出box
,我会不断得到这个信息:
But if I print out box
, I keep getting this:
<bndbox>
<xmin>337</xmin>
<ymin>2</ymin>
<xmax>382</xmax>
<ymax>66</ymax>
</bndbox>
这对我来说毫无意义.在<part>
标签下,我不断得到的上方框比我要的框低一个层次,因此我只能通过以下方式访问该框:
This makes no sense to me. The box above that I keep getting is one hierarchy level lower than what I'm asking for, under a <part>
tag, so I should only be able to access this box via
object.part.bndbox
同时
object.bndbox
应该给我唯一一个层次结构直接位于object
标签下面的框,这是上面代码段中的最后一个框.
should give me the only box that is hierarchically directly under the object
tag, which is the last box in the snippet above.
推荐答案
如 @mjsqu 所述="https://stackoverflow.com/questions/49971317/xml-reader-seems-to-ignore-tag-hierarchy#comment86958670_49971317">评论:
As stated by @mjsqu in the comments:
BeautifulSoup返回与该名称匹配的第一个标记,因此object.bbox引用XML中的第一个bbox,而不管层次结构中的位置如何.
BeautifulSoup returns the first tag matching that name, so object.bbox refers to the first bbox in the XML, regardless of position in the hierarchy.
因此,要获取 second <bndbox>
标记,或者<bndbox>
是<object>
标记的直接子代,可以使用recursive=False
作为参数.这只会寻找属于当前标签的直接子标签.
So, to get the second <bndbox>
tag, or, the <bndbox>
which is the direct child of the <object>
tag, you can use recursive=False
as a parameter. This will look only for the tags that are direct children of the current tag.
xml = '''
<object>
<name>person</name>
<part>
<name>head</name>
<bndbox>
<xmin>337</xmin>
<ymin>2</ymin>
<xmax>382</xmax>
<ymax>66</ymax>
</bndbox>
</part>
<bndbox>
<xmin>334</xmin>
<ymin>1</ymin>
<xmax>436</xmax>
<ymax>373</ymax>
</bndbox>
</object>'''
soup = BeautifulSoup(xml, 'xml')
print(soup.object.find('bndbox', recursive=False))
输出:
<bndbox>
<xmin>334</xmin>
<ymin>1</ymin>
<xmax>436</xmax>
<ymax>373</ymax>
</bndbox>
这篇关于XML阅读器似乎忽略了标签层次结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!