Python XML删除了一些元素及其子元素,但保留了特定的元素及其子元素 [英] Python XML Remove Some Elements and Their Children but Keep Specific Elements and Their Children

查看:72
本文介绍了Python XML删除了一些元素及其子元素,但保留了特定的元素及其子元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常大的.xml文件,我正在尝试制作一个新的.xml文件,该文件仅占该较大文件内容的一小部分.我想指定一个属性(在我的情况下为itemID),并为其指定一些特定的值,然后它将除去所有具有那些itemID及其子元素的元素.

I have a very large .xml file and I am trying to make a new .xml file that just has a small part of this larger file's contents. I want to specify an attribute (in my case, an itemID) and give it a few specific values and then it would strip away all the elements except for the ones that have those itemIDs and their children.

我的大型.xml文件如下所示:

My large .xml file looks something like this:

<?xml version='1.0' encoding='UTF-8'?>
<api version="2">
  <currentTime>2013-02-27 17:00:18</currentTime>
  <result>
    <rowset name="assets" key="itemID" columns="itemID,locationID,typeID,quantity,flag,singleton">
      <row itemID="1008551770576" locationID="31000559" typeID="17187" quantity="1" flag="0" singleton="1" rawQuantity="-1" />
      <row itemID="1008700753886" locationID="31000559" typeID="17187" quantity="1" flag="0" singleton="1" rawQuantity="-1" />
      <row itemID="1008700756994" locationID="31000559" typeID="17184" quantity="1" flag="0" singleton="1" rawQuantity="-1" />
      <row itemID="1008701224901" locationID="31000559" typeID="17186" quantity="1" flag="0" singleton="1" rawQuantity="-1" />
      <row itemID="1004072840841" locationID="31002238" typeID="17621" quantity="1" flag="0" singleton="1" rawQuantity="-1">
        <rowset name="contents" key="itemID" columns="itemID,typeID,quantity,flag,singleton">
          <row itemID="150571923" typeID="25863" quantity="2" flag="119" singleton="0" />
          <row itemID="188435728" typeID="3388" quantity="1" flag="119" singleton="0" />
          <row itemID="210122947" typeID="3419" quantity="4" flag="119" singleton="0" />
        </rowset>
      </row>
      <row itemID="1005279202146" locationID="31002238" typeID="17621" quantity="1" flag="0" singleton="1" rawQuantity="-1">
        <rowset name="contents" key="itemID" columns="itemID,typeID,quantity,flag,singleton">
          <row itemID="1004239962001" typeID="16275" quantity="49596" flag="4" singleton="0" />
          <row itemID="1005364142068" typeID="4246" quantity="156929" flag="4" singleton="0" />
          <row itemID="1005624252854" typeID="4247" quantity="93313" flag="4" singleton="0" />
        </rowset>
      </row>
      <row itemID="1004388226389" typeID="648" quantity="1" flag="0" singleton="1" rawQuantity="-1">
        <rowset name="contents" key="itemID" columns="itemID,typeID,quantity,flag,singleton">
          <row itemID="1004388228218" typeID="31119" quantity="1" flag="92" singleton="1" rawQuantity="-1" />
          <row itemID="1004388701243" typeID="31119" quantity="1" flag="94" singleton="1" rawQuantity="-1" />
          <row itemID="1004388701485" typeID="31119" quantity="1" flag="93" singleton="1" rawQuantity="-1" />
          <row itemID="1009147502645" typeID="51" quantity="1" flag="5" singleton="1" rawQuantity="-1" />
        </rowset>
      </row>
    </rowset>
  </result>
  <cachedUntil>2013-02-27 23:00:18</cachedUntil>
</api>

此文件大约有9万行,大约9兆字节.

This file has around ninety thousand rows and is about 9 megabytes.

请注意如何有itemID,并且某些项目类型可以(但不总是)在其中包含更多项目,并且这些子项也具有自己的itemID.我正在尝试获取一些特定的itemID和他们的孩子,而忽略其他所有对象.

Note how there are itemIDs and some item types can (but doesn't always) have more items inside them and these children also have their own itemIDs. I am trying to get a few specific itemIDs and their children and leave out all the others.

我使用了此答案中的代码,它使我非常接近.除了将我使用的itemID的子级排除在外,它是完美的.

I used the code from this answer and it gets me quite close. It is perfect except that it leaves out the children of the itemID I used.

我的代码如下:

import lxml.etree as le

##Take this big .xml file and pull out just the parts we want then write those to a new .xml file##
with open(filename,'r') as f:
    doc=le.parse(f)
    for elem in doc.xpath('//*[attribute::itemID]'):
        if elem.attrib['itemID']=='1004072840841':
            elem.attrib.pop('itemID')
        else:
            parent=elem.getparent()
            parent.remove(elem)
    print(le.tostring(doc))

这是结果打印出来的样子:

This is what the resulting print out looks like:

<api version="2">
  <currentTime>2013-03-01 21:46:52</currentTime>
  <result>
    <rowset name="assets" key="itemID" columns="itemID,locationID,typeID,quantity,flag,singleton">
      <row locationID="31002238" typeID="17621" quantity="1" flag="0" singleton="1" rawQuantity="-1">
        <rowset name="contents" key="itemID" columns="itemID,typeID,quantity,flag,singleton">
          </rowset>
      </row>
      </rowset>
  </result>
  <cachedUntil>2013-03-02 03:46:53</cachedUntil>
</api>

我希望它看起来像这样:

I want it to look like this:

<api version="2">
  <currentTime>2013-03-01 21:46:52</currentTime>
  <result>
    <rowset name="assets" key="itemID" columns="itemID,locationID,typeID,quantity,flag,singleton">
      <row locationID="31002238" typeID="17621" quantity="1" flag="0" singleton="1" rawQuantity="-1">
        <rowset name="contents" key="itemID" columns="itemID,typeID,quantity,flag,singleton">
          <row itemID="150571923" typeID="25863" quantity="2" flag="119" singleton="0" />
          <row itemID="188435728" typeID="3388" quantity="1" flag="119" singleton="0" />
          <row itemID="210122947" typeID="3419" quantity="4" flag="119" singleton="0" />
        </rowset>
      </row>
      </rowset>
  </result>
  <cachedUntil>2013-03-02 03:46:53</cachedUntil>
</api>

我对代码的理解不够充分,无法看到需要更改的内容,以便也包括要搜索的itemID的子代.另外,理想情况下,我将能够放入多个itemID,它将除去那些itemID及其子对象之外的所有对象.这意味着它将需要保留itemID=[number]行属性(以便在使用此xml文件时可以使用xPath引用特定的itemID及其子元素.)

I don't understand the code well enough to see what I'd need to change in order to also include the children of the itemID I have it search for. Also, ideally I'd be able to put in multiple itemIDs and it would strip away all but those itemIDs and their children. This means it would need to keep the itemID=[number] row attribute (so that I could use xPath to refer to a particular itemID and its children when I use this xml file.)

所以我的主要问题是关于如何将搜索到的itemID的子代包括在生成的.xml中.我的第二个问题是关于如何同时对一个以上itemID执行此操作(以便生成的.xml文件将除去那些itemID及其子对象之外的所有对象).

So my main question is about how to include the children of the itemID I search for in my resulting .xml. My secondary question is about how to do this for more than one itemID at the same time (so that the resulting .xml file would strip away all but those itemIDs and their children.)

我发现elem.attrib.pop('itemID')部分是取出itemID的部分,由于我想拥有多个itemID,而他们的孩子仍然留着,因此我需要保留这一部分,所以我取出了这一部分.我试图找到一种方法来跳过带有正在搜索的itemID的行的子项,而我想到的是用一个属性标记每个人,然后我可以搜索并删除所有不包含的属性具有该属性.我不需要我做的flag属性,因此我继续将其用于此目的(因为当我尝试对其进行迭代时,引入新属性的尝试遇到了关键错误.)孩子还不够,我还必须给孩子的孩子加上标签.

I figured out that the elem.attrib.pop('itemID') part was the part that took out the itemID and since I'd like to have mutliple itemIDs and their children remain I needed to keep this, so I took that part out. I was trying to find a way to skip over the children of the line with the itemID I was searching for and what I came up with was to flag each one with an attribute that I could then search back over and delete all that don't have that attribute. I don't need the flag attribute for what I'm doing so I went ahead and used it for this purpose (as attempts to introduce a new attribute were meeting with a key error when I tried to iterate back over them.) Just flagging the children wasn't enough, I had to also tag the children's children.

这是我的丑陋解决方案:

Here is my ugly solution:

with open(filename,'r') as f:
    doc=le.parse(f)
    for elem in doc.xpath('//*[attribute::itemID]'):
        if elem.attrib['itemID']=='1004072840841' or elem.attrib['itemID']=='1005279202146': # this or statement lets me get a resulting .xml file that has two itemIDs and their children
            elem.attrib['flag']='Keep'
            for child in elem.iterchildren():
                child.attrib['flag']='Keep'
                for c in child.iterchildren():
                    c.attrib['flag']='Keep'
        else:
            pass
    for e in doc.xpath('//*[attribute::flag]'):
        if e.attrib['flag']!='Keep':
            parent=e.getparent()
            parent.remove(e)
        else:
            pass
    print(le.tostring(doc))
    ##This part writes the pruned down .xml to a file##
    with open('test.xml', 'w') as t:
        for line in le.tostring(doc):
            t.write(line)
    t.close

这个丑陋的解决方案涉及对数据的大量迭代,我怀疑这不是最有效的方法,但是它确实有效.

This ugly solution involves a lot of iterating over the data and is, I suspect, far from the most efficient way of getting this done, but it does work.

推荐答案

目前尚不清楚您所追求的是什么,但是这段代码会产生您想要的输出:

It's not very clear exactly what you're after, but this code produces the output you say you'd like:

from lxml import etree as ET

def filter_by_itemid(doc, idlist):
    rowset = doc.xpath("/api/result/rowset[@name='assets']")[0]
    for elem in rowset.getchildren():
        if int(elem.get("itemID")) not in idlist:
            rowset.remove(elem)
    return doc

doc = ET.parse("test.xml")
filter_by_itemid(doc, [1004072840841])

print(ET.tostring(doc))

这篇关于Python XML删除了一些元素及其子元素,但保留了特定的元素及其子元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆