在python中获取XML的所有相同属性值 [英] Get all same attribute values for XML in python

查看:47
本文介绍了在python中获取XML的所有相同属性值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图获得所有的积分".来自"TextRegion--"的属性值Coords"标签.我不断从中得到错误.注意:有些标签称为"TextRegion",和"ImageRegion"两者都包含坐标".但是,我只希望来自"TextRegion"的Coords点.

请帮助!谢谢!!

这是我的xml文件:

 <?xml version ="1.0"encoding ="UTF-8"standalone =否"?< PcGts xmlns =" http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15"<元数据><创作者/>< Created> 2021-01-24T17:11:35</Created>< LastChange> 1969-12-31T19:00:00</LastChange><评论/></Metadata><页面imageFilename ="0004.png";imageHeight ="3655"imageWidth ="2493"< TextRegion id =" r1"type ="paragraph"< Coords points ="1653,146 1651,148"/>< TextEquiv>< Unicode/></TextEquiv></TextRegion>< TextRegion id =" r2"type ="paragraph"< Coords points ="2071,326 2069,328 2058,328 2055"/>< TextEquiv>< Unicode/></TextEquiv></TextRegion>< ImageRegion id ="r3">< Coords points ="443,621 443,2802 2302,2802 2302,621''/></ImageRegion>< TextRegion id =" r4"type ="paragraph"< Coords points ="2247,2825 2247,2857 2266,2857 2268,2860 2268"/< TextEquiv>< Unicode/></TextEquiv></TextRegion>< TextRegion id =" r5"type ="paragraph"< Coords points ="731,2828 731,2839 728,2841"/>< TextEquiv>< Unicode/></TextEquiv></TextRegion></Page></PcGts> 

这是我的代码:

来自lxml的

 导入etree作为ET树= ET.parse('0004.xml')根= tree.getroot()打印(root.tag)用于root.find_all('Page/TextRegion/Coords')中的标记:值= tag.get('points')打印(值) 

解决方案

假设您发布的XML是复制/粘贴问题,而缺少根元素打开的关闭,则另一个主要问题是经典的XML解析问题,该问题涉及解析以下节点一个默认名称空间,该名称空间包括以 xmlns 开头而没有冒号分隔的前缀之类的任何属性,例如 xmlns:doc ="..." .

因此,您需要在Python中定义一个临时的名称空间前缀,以解析每个命名元素,您可以使用传递给 findall (而不是 find_all )的字典来完成此操作

lxml中的

 来自exml将etree导入为ET树= ET.parse('0004.xml')nsmp = {'doc':'http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15'}根= tree.getroot()打印(root.tag)#指定名称空间并前缀所有命名元素用于root.findall('doc:Page/doc:TextRegion/doc:Coords',名称空间= nsmp)中的标记:值= tag.get('points')打印(值)#1653,146 1651,148#2071,326 2069,328 2058,328 2055#2247,2825 2247,2857 2266,2857 2268,2860 2268#731,2828 731,2839 728,2841 

顺便说一句, lxml 是一个功能丰富的XML库(需要第三方安装),其中其他强大的工具也支持完整的XPath 1.0.只需将xml中的 import 行更改为,上面的代码仍然可以与Python的内置 etree 一起使用.etree将ElementTree导入为ET .

但是, lxml 扩展了该库,例如使用直接解析为属性xpath :

  tree = ET.parse('0004.xml')#指定名称空间并前缀所有命名元素对于tree.xpath('//doc:Coords/@ points',namespaces = nsmp)中的pts:打印(点)#1653,146 1651,148#2071,326 2069,328 2058,328 2055#2247,2825 2247,2857 2266,2857 2268,2860 2268#731,2828 731,2839 728,2841 

I was trying to get all "points" attribute values from "TextRegion--> Coords" tag. I keep getting errors from it. Note: there are tags called "TextRegion" and "ImageRegion" which both contain "Coords". However, I only want the Coords points from "TextRegion".

Please help! Thank you!!

Here is my xml file:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15"
    <Metadata>
        <Creator/>
        <Created>2021-01-24T17:11:35</Created>
        <LastChange>1969-12-31T19:00:00</LastChange>
        <Comments/>
    </Metadata>
    <Page imageFilename="0004.png" imageHeight="3655" imageWidth="2493">
        <TextRegion id="r1" type="paragraph">
            <Coords points="1653,146 1651,148"/>
            <TextEquiv>
                <Unicode/>
            </TextEquiv>
        </TextRegion>
        <TextRegion id="r2" type="paragraph">
            <Coords points="2071,326 2069,328 2058,328 2055"/>
            <TextEquiv>
                <Unicode/>
            </TextEquiv>
        </TextRegion>
        <ImageRegion id="r3">
            <Coords points="443,621 443,2802 2302,2802 2302,621"/>
        </ImageRegion>
        <TextRegion id="r4" type="paragraph">
            <Coords points="2247,2825 2247,2857 2266,2857 2268,2860 2268"/>
            <TextEquiv>
                <Unicode/>
            </TextEquiv>
        </TextRegion>
        <TextRegion id="r5" type="paragraph">
            <Coords points="731,2828 731,2839 728,2841"/>
            <TextEquiv>
                <Unicode/>
            </TextEquiv>
        </TextRegion>
    </Page>
</PcGts>

Here is my code:

from lxml import etree as ET

tree = ET.parse('0004.xml')
root = tree.getroot()
print(root.tag)

for tag in root.find_all('Page/TextRegion/Coords'):
    value = tag.get('points')
    print(value)

解决方案

Assuming your posted XML is a copy/paste issue with missing closing of root element opening, your other main issue is the classic XML parsing issue which involves parsing nodes under a default namespace which includes any attribute starting with xmlns without a colon separated prefix like xmlns:doc="...".

As a result, you need to define a temporary namespace prefix in Python to parse each named element which you can do with a dictionary passed into findall (not find_all).

from lxml import etree as ET

tree = ET.parse('0004.xml')
nsmp = {'doc': 'http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15'}

root = tree.getroot()
print(root.tag)

# SPECIFY NAMESPACE AND PREFIX ALL NAMED ELEMENTS
for tag in root.findall('doc:Page/doc:TextRegion/doc:Coords', namespaces=nsmp):
    value = tag.get('points')
    print(value)

# 1653,146 1651,148
# 2071,326 2069,328 2058,328 2055
# 2247,2825 2247,2857 2266,2857 2268,2860 2268
# 731,2828 731,2839 728,2841

By the way, lxml is a feature-rich XML library (that required 3rd party installation) that among other powerful tools supports full XPath 1.0. The above code can still work with Python's built-in etree simply by changing import line as from xml.etree import ElementTree as ET.

However, lxml extends this library such as parsing directly to attributes with xpath:

tree = ET.parse('0004.xml')

# SPECIFY NAMESPACE AND PREFIX ALL NAMED ELEMENTS
for pts in tree.xpath('//doc:Coords/@points', namespaces=nsmp):
    print(pts)

# 1653,146 1651,148
# 2071,326 2069,328 2058,328 2055
# 2247,2825 2247,2857 2266,2857 2268,2860 2268
# 731,2828 731,2839 728,2841

这篇关于在python中获取XML的所有相同属性值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆