与ElementTree相似,如何使用lxml遍历XML文档标签 [英] How to traverse through XML document tags using lxml similarly to ElementTree

查看:48
本文介绍了与ElementTree相似,如何使用lxml遍历XML文档标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当前,我正在编辑XML文档,在这里我必须编辑一些标签及其属性.到目前为止,我正在使用ElementTree库,但是在命名空间保留方面遇到了问题,因此我试图重写脚本以使用lxml.但是ElementTree对于我遍历文档标签的情况非常合乎逻辑.下面以示例为例,我将提供代码,该代码将删除XML中的Ext标记,并将Resolution标记文本更改为其他值.

Currently I'm editing XML document, where I have to edit few tags and their attributes. Up to now I was using ElementTree library, however I encountered problems with namespace preservation, so I'm trying to rewrite my script to use lxml. ElementTree however was very logical for me in case of traversing through the document tags. Below as an example, I'll provide code that will remove Ext tag in XML, and change Resolution tag text to different value.

ElementTree:

ElementTree:

namespaces = dict([elem for _, elem in ET.iterparse(adiPath, events=['start-ns'])])
for ns in namespaces:
    ET.register_namespace(ns, namespaces[ns])
for asset in root.findall('.//{*}Asset'):
    if 'title:TitleType' in asset.attrib.values():
        ext = asset.find('.//{*}Ext')
        if ext != None:
            asset.remove(ext)
    if 'content:PreviewType' in asset.attrib.values():
            resolution = asset.find(".//{*}Resolution")
            resolution.text = 'different value'

是否可以以与上述类似的方式遍历XML文件,但是可以使用lxml代替ET?

Is it possible to iterate through XML file in similar way to above mentioned, but instead of ET use lxml?

XML文件:

<?xml version="1.0" encoding="utf-8"?>
<ADI3 xmlns="urn:cablelabs:md:xsd:core:3.0"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xmlns:content="urn:cablelabs:md:xsd:content:3.0"
      xmlns:core="urn:cablelabs:md:xsd:core:3.0"
      xmlns:offer="urn:cablelabs:md:xsd:offer:3.0"
      xmlns:terms="urn:cablelabs:md:xsd:terms:3.0"
      xmlns:title="urn:cablelabs:md:xsd:title:3.0"
      xmlns:adb="urn:adb:md:xsd:adb:01"
      xmlns:schemaLocation="urn:adb:md:xsd:adb:01 ADB-EXT-C01.xsd urn:cablelabs:md:xsd:core:3.0 MD-SP-CORE-C01.xsd urn:cablelabs:md:xsd:content:3.0 MD-SP-CONTENT-C01.xsd urn:cablelabs:md:xsd:offer:3.0 MD-SP-OFFER-C01.xsd urn:cablelabs:md:xsd:terms:3.0 MD-SP-TERMS-C01.xsd urn:cablelabs:md:xsd:title:3.0 MD-SP-TITLE-C01.xsd"
      xmlns:xml="http://www.w3.org/XML/1998/namespace">
  <Asset xsi:type="title:TitleType" uriId="ID" providerVersionNum="5"
     internalVersionNum="0" creationDateTime="2020-04-22T00:00:00Z"
     startDateTime="2020-03-24T09:00:00Z" endDateTime="2022-10-06T23:59:00Z">
    <AlternateId identifierSystem="VOD1.1">ID</AlternateId>
    <ProviderQAContact>Contact</ProviderQAContact>
    <Ext>
      <adb:ExtensionType>
        <adb:TitleExt>
          <adb:SeriesInfo episodeNumber="16">
            <adb:series seriesId="106585" seasonCount="2"/>
            <adb:season seasonId="106586" number="1" episodeCount="22"/>
          </adb:SeriesInfo>
        </adb:TitleExt>
      </adb:ExtensionType>
    </Ext>
    <title:LocalizableTitle xml:lang="pol">
      <title:TitleLong>BATWOMAN EP. 16 - THROUGH THE LOOKING GLASS</title:TitleLong>
      <title:SummaryLong> Very long summary...</title:SummaryLong>
      <title:Actor fullName="Ruby Rose" firstName="Ruby" lastName="Rose"/>
      <title:Actor fullName="Rachel Skarsten" firstName="Rachel" lastName="Skarsten"/>
      <title:Actor fullName="Meagan Tandy" firstName="Meagan" lastName="Tandy"/>
      <title:Actor fullName="Camrus Johnson" firstName="Camrus" lastName="Johnson"/>
      <title:Director fullName="Sudz Sutherland" firstName="Sudz" lastName="Sutherland"/>
    </title:LocalizableTitle>
    <title:Rating ratingSystem="PL">12</title:Rating>
    <title:DisplayRunTime>00:40</title:DisplayRunTime>
    <title:Year>2019</title:Year>
    <title:CountryOfOrigin>US</title:CountryOfOrigin>
    <title:Genre>Genre</title:Genre>
    <title:ShowType>Movie</title:ShowType>
  </Asset>
  <Asset xsi:type="offer:CategoryType" uriId="ID">
    <AlternateId identifierSystem="VOD1.1">ID</AlternateId>
    <offer:CategoryPath>Path</offer:CategoryPath>
  </Asset>
  <Asset xsi:type="content:MovieType" uriId="namemp4">
    <AlternateId identifierSystem="VOD1.1">namemp4</AlternateId>
    <content:SourceUrl>name.mp4</content:SourceUrl>
    <content:Resolution>resolution</content:Resolution>
    <content:Duration>PT0H40M40S</content:Duration>
    <content:Language>pol</content:Language>
    <content:SubtitleLanguage>pol</content:SubtitleLanguage>
    <content:SubtitleLanguage>eng</content:SubtitleLanguage>
  </Asset>
  <Asset uriId="ID" xsi:type="content:MovieType">
    <AlternateId identifierSystem="VOD1.1">ID</AlternateId>
    <Provider>Prov</Provider>
    <content:SourceUrl>sub.srt</content:SourceUrl>
  </Asset>
  <Asset uriId="ID" xsi:type="content:MovieType">
    <AlternateId identifierSystem="VOD1.1">ID</AlternateId>
    <Provider>Prov</Provider>
    <content:SourceUrl>sub.srt</content:SourceUrl>
  </Asset>
  <Asset xsi:type="content:PosterType" uriId="ID">
    <AlternateId identifierSystem="VOD1.1">ID</AlternateId>
    <content:SourceUrl>poster.jpg</content:SourceUrl>
    <content:X_Resolution>700</content:X_Resolution>
    <content:Y_Resolution>1000</content:Y_Resolution>
    <content:Language>pol</content:Language>
  </Asset>
  <Asset xsi:type="offer:ContentGroupType" uriId="ID">
    <AlternateId identifierSystem="VOD1.1">ID</AlternateId>
    <offer:TitleRef uriId="ID"/>
    <offer:MovieRef uriId="namets"/>
    <offer:MovieRef uriId="subs"/>
    <offer:MovieRef uriId="subs"/>
  </Asset>
  <Asset xsi:type="offer:ContentGroupType" uriId="ID">
    <AlternateId identifierSystem="VOD1.1">ID</AlternateId>
    <offer:TitleRef uriId="ID"/>
    <offer:MovieRef uriId="poster"/> 
  </Asset>
</ADI3>

推荐答案

关于输入文档的观察结果:

Observations about your input document:

  • 文档将默认名称空间(xmlns="...")定义为urn:cablelabs:md:xsd:core:3.0.
  • 它第二次定义与"core"(xmlns:core="urn:cablelabs:md:xsd:core:3.0")相同的名称空间.
  • xmlns:schemaLocation是错误的,应该为xsi:schemaLocation.
  • 根本不使用名为"terms"(urn:cablelabs:md:xsd:terms:3.0)的命名空间.
  • The document defines the default namespace (xmlns="...") as urn:cablelabs:md:xsd:core:3.0.
  • It defines the same namespace a second time as "core" (xmlns:core="urn:cablelabs:md:xsd:core:3.0").
  • xmlns:schemaLocation is wrong and should be xsi:schemaLocation.
  • the namespace called "terms" (urn:cablelabs:md:xsd:terms:3.0) is not used at all.

当您阅读并再次编写该文档时,正如代码示例所做的那样,所有信息都将保留.

When you read this document and write it again, as your code sample does it, all the information is retained.

但是不能保证输出文档是输入文档的逐字符副本. 那不是XML的工作方式,这是不合理的期望.重要的保证是输出文档在语义上等同于输入文档.

But there is no guarantee that the output document is a character-by-character copy of the input document. That's not how XML works, and it's an unreasonable expectation. The guarantee that matters is that the output document is semantically equivalent to the input document.

代码运行时,将产生以下输出(节略的):

When your code runs, it produces this output (abridged):

<core:ADI3
  xmlns:adb="urn:adb:md:xsd:adb:01"
  xmlns:content="urn:cablelabs:md:xsd:content:3.0"
  xmlns:core="urn:cablelabs:md:xsd:core:3.0" 
  xmlns:offer="urn:cablelabs:md:xsd:offer:3.0"
  xmlns:title="urn:cablelabs:md:xsd:title:3.0" 
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>
  <core:Asset xsi:type="title:TitleType" uriId="ID" providerVersionNum="5" internalVersionNum="0" creationDateTime="2020-04-22T00:00:00Z" startDateTime="2020-03-24T09:00:00Z" endDateTime="2022-10-06T23:59:00Z">
    <core:AlternateId identifierSystem="VOD1.1">ID</core:AlternateId>

    <!-- ... -->

  </core:Asset>
</core:ADI3>

与以前一样,ADI3元素仍位于urn:cablelabs:md:xsd:core:3.0命名空间中.这是通过默认名称空间还是通过显式前缀实现都是无关紧要的. ElementTree知道此命名空间的前缀"core",并决定使用它.没什么错,还是一样.

The ADI3 element is still in the urn:cablelabs:md:xsd:core:3.0 namespace, as before. Whether this is achieved via default namespace or via explicit prefix is irrelevant. ElementTree knew a prefix for this namespace - "core" - and decided to use it. There is nothing wrong with that, it's still the same thing.

输出中缺少名称空间urn:cablelabs:md:xsd:terms:3.0(术语"),因为在输入中未使用该名称空间,并且保留未使用的声明是毫无意义的.

The namespace urn:cablelabs:md:xsd:terms:3.0 ("terms") is missing from the output because it was unused in the input and keeping unused declarations is pointless.

同样的情况适用于"schemaLocation"-因为您将其编写为名称空间声明(xmlns:schemaLocation),所以ElementTree看到此名称空间"未被使用并剥离了它.正确的是带有名称空间(xsi:schemaLocation)的属性.更正该错误后,该项目将保留在输出中.

The same thing applies to the "schemaLocation" - because you wrote it as a namespace declaration (xmlns:schemaLocation), ElementTree saw that this "namespace" was unused and stripped it. Correct would have been an attribute with a namespace (xsi:schemaLocation). When you correct that error, this item will stay in the output.

总结一下:您没有问题.输出文件是相同的.

To sum it all up: You don't have a problem. The output document is the same.

这篇关于与ElementTree相似,如何使用lxml遍历XML文档标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆