XML阅读器似乎忽略了标签层次结构 [英] XML reader seems to ignore tag hierarchy

查看:59
本文介绍了XML阅读器似乎忽略了标签层次结构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在XML文件中,我试图获取在标签层次结构中的不同级别上多次出现的标签内容.我正在尝试获取标记中出现的最高级别的内容,但是我的XML阅读器(BeautifulSoup for Python)一直在给我错误的内容.

In an XML file, I'm trying to get the content of a tag that appears multiple times at different levels in the tag hierarchy. I'm trying to get the content of the highest level occurrence of the tag, but my XML reader (BeautifulSoup for Python) keeps giving me the wrong content.

这是具体的问题.这是XML文件的一部分(浓缩为我认为相关的部分):

Here is the concrete problem. This is part of the XML file (condensed to the parts I believe are relevant):

<object>
    <name>person</name>
    <part>
        <name>head</name>
        <bndbox>
            <xmin>337</xmin>
            <ymin>2</ymin>
            <xmax>382</xmax>
            <ymax>66</ymax>
        </bndbox>
    </part>
    <bndbox>
        <xmin>334</xmin>
        <ymin>1</ymin>
        <xmax>436</xmax>
        <ymax>373</ymax>
    </bndbox>
</object>

我有兴趣通过命令在此代码段的最后获取<bndbox>标记的内容

I'm interested in getting the content of the <bndbox> tag at the very end of this snippet via the command

box = object.bndbox

但是,如果我打印出box,我会不断得到这个信息:

But if I print out box, I keep getting this:

<bndbox>
    <xmin>337</xmin>
    <ymin>2</ymin>
    <xmax>382</xmax>
    <ymax>66</ymax>
</bndbox>

这对我来说毫无意义.在<part>标签下,我不断得到的上方框比我要的框低一个层次,因此我只能通过以下方式访问该框:

This makes no sense to me. The box above that I keep getting is one hierarchy level lower than what I'm asking for, under a <part> tag, so I should only be able to access this box via

object.part.bndbox

同时

object.bndbox

应该给我唯一一个层次结构直接位于object标签下面的框,这是上面代码段中的最后一个框.

should give me the only box that is hierarchically directly under the object tag, which is the last box in the snippet above.

推荐答案

@mjsqu 所述="https://stackoverflow.com/questions/49971317/xml-reader-seems-to-ignore-tag-hierarchy#comment86958670_49971317">评论:

As stated by @mjsqu in the comments:

BeautifulSoup返回与该名称匹配的第一个标记,因此object.bbox引用XML中的第一个bbox,而不管层次结构中的位置如何.

BeautifulSoup returns the first tag matching that name, so object.bbox refers to the first bbox in the XML, regardless of position in the hierarchy.

因此,要获取 second <bndbox>标记,或者<bndbox><object>标记的直接子代,可以使用recursive=False作为参数.这只会寻找属于当前标签的直接子标签.

So, to get the second <bndbox> tag, or, the <bndbox> which is the direct child of the <object> tag, you can use recursive=False as a parameter. This will look only for the tags that are direct children of the current tag.

xml = '''
<object>
    <name>person</name>
    <part>
        <name>head</name>
        <bndbox>
            <xmin>337</xmin>
            <ymin>2</ymin>
            <xmax>382</xmax>
            <ymax>66</ymax>
        </bndbox>
    </part>
    <bndbox>
        <xmin>334</xmin>
        <ymin>1</ymin>
        <xmax>436</xmax>
        <ymax>373</ymax>
    </bndbox>
</object>'''

soup = BeautifulSoup(xml, 'xml')
print(soup.object.find('bndbox', recursive=False))

输出:

<bndbox>
<xmin>334</xmin>
<ymin>1</ymin>
<xmax>436</xmax>
<ymax>373</ymax>
</bndbox>

这篇关于XML阅读器似乎忽略了标签层次结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆