使用lxml从xml中提取嵌套名称空间 [英] Extracting nested namespace from a xml using lxml

查看:108
本文介绍了使用lxml从xml中提取嵌套名称空间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Python的新手,目前正在学习解析XML.直到我碰到嵌套名称空间的墙,一切似乎都进行得很好.

I'm new to Python and currently learning to parse XML. All seems to be going well until I hit a wall with nested namespaces.

下面是我的xml的摘要(带有我要解析的开始和子元素:

Below is an snippet of my xml ( with a beginning and child element that I'm trying to parse:

<?xml version="1.0" encoding="UTF-8"?>
-<CompositionPlaylist xmlns="http://www.digicine.com/PROTO-ASDCP-CPL-20040511#">
<!-- Generated by orca_wrapping version 3.8.3-0 -->
<Id>urn:uuid:e0e43007-ca9b-4ed8-97b9-3ac9b272be7a</Id>
-------------
-------------
------------- 
-<cc-cpl:MainClosedCaption xmlns:cc-cpl="http://www.digicine.com/PROTO- ASDCP-CC-CPL-20070926#"><Id>urn:uuid:0607e57f-edcc-46ec- 997a-d2fbc0c1ea3a</Id><EditRate>24 1</EditRate><IntrinsicDuration>2698</IntrinsicDuration></cc-cpl:MainClosedCaption>
------------
------------
------------
</CompositionPlaylist>

我需要的是一种提取本地名称"MainClosedCaption"的URI的解决方案.在这种情况下,我尝试提取字符串" http://www.digicine.com/PROTO- ASDCP-CC-CPL-20070926#".我浏览了很多教程,但似乎找不到解决方案.

What I'm need is a solution to extract the URI of the local name 'MainClosedCaption'. In this case, I'm trying to extract the string "http://www.digicine.com/PROTO- ASDCP-CC-CPL-20070926#". I looked through a lot of tutorials but cannot seems to find a solution.

如果有任何人可以借给您专业知识,将不胜感激.

If there's anyone out there can lend your expertise, it would be much appreciated.

这是到目前为止,我在两位贡献者的帮助下所做的事情:

Here what I did so far with the help from the two contributors:

#!/usr/bin/env python

from xml.etree import ElementTree as ET #import ElementTree module as an alias ET
from lxml import objectify, etree

def parse():

import os
import sys
cpl_file = sys.argv[1]
xml_file = os.path.abspath(__file__)
xml_file = os.path.dirname(xml_file)
xml_file = os.path.join(xml_file,cpl_file)

with open(xml_file)as f:
    xml = f.read()

tree = etree.XML(xml)

caption_namespace = etree.QName(tree.find('.//{*}MainClosedCaption')).namespace

print caption_namespace
print tree.nsmap

nsmap = {}

for ns in tree.xpath('//namespace::*'):
    if ns[0]:
        nsmap[ns[0]] = ns[1]
tree.xpath('//cc-cpl:MainClosedCaption', namespace=nsmap)

return nsmap


if __name__=="__main__":

parse()

但是到目前为止,它还没有奏效.当我使用QName定位标签及其名称空间时,得到的结果为"None".当我尝试按照另一篇文章中的建议使用for循环在XML中定位所有名称空间时,出现错误未知返回类型:dict"

But it's not working so far. I got the result 'None' when I used QName to locate the tag and its namespace. And when I try to locate all namespace in the XML using for loop as suggested in another post, I got the error 'Unknown return type: dict'

有什么建议吗?

推荐答案

此程序将显示指定标记的名称空间:

This program prints the namespace of the indicated tag:

from lxml import etree

xml = etree.XML('''<?xml version="1.0" encoding="UTF-8"?>
<CompositionPlaylist xmlns="http://www.digicine.com/PROTO-ASDCP-CPL-20040511#">
<!-- Generated by orca_wrapping version 3.8.3-0 -->
<Id>urn:uuid:e0e43007-ca9b-4ed8-97b9-3ac9b272be7a</Id>
<cc-cpl:MainClosedCaption xmlns:cc-cpl="http://www.digicine.com/PROTO-ASDCP-CC-CPL-20070926#">
<Id>urn:uuid:0607e57f-edcc-46ec- 997a-d2fbc0c1ea3a</Id>
<EditRate>24 1</EditRate>
<IntrinsicDuration>2698</IntrinsicDuration>
</cc-cpl:MainClosedCaption>
</CompositionPlaylist>
''')

print etree.QName(xml.find('.//{*}MainClosedCaption')).namespace

结果:

http://www.digicine.com/PROTO-ASDCP-CC-CPL-20070926#

参考: http://lxml.de/tutorial.html#namespaces

这篇关于使用lxml从xml中提取嵌套名称空间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆