搜索/替换xml的内容 [英] search/replace content of xml

查看:64
本文介绍了搜索/替换xml的内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经成功地使用xml.etree.ElementTree解析了一个xml,搜索内容,然后将其写入另一个xml。但是,我只是在singe标签内处理文本。

I've been successful using xml.etree.ElementTree to parse an xml, search for content, then write this to a different xml. However, I just worked with text, inside of a singe tag.

import os, sys, glob, xml.etree.ElementTree as ET
path = r"G:\\63D RRC GIS Data\\metadata\\general\\2010_contract"
for fn in os.listdir(path):
    filepaths = glob.glob(path + os.sep + fn + os.sep + "*overall.xml")
    for filepath in filepaths:
        (pa, filename) = os.path.split(filepath)
        ####use this section to grab element text from old, archived metadata files; this text then gets put into current, working .xml###
        root = ET.parse(pa + os.sep + "archive" + os.sep + "base_metadata_overall.xml").getroot()

        iterator = root.getiterator()
        for item in iterator:
            if item.tag == "abstract":
                correct_abstract = item.text

        root2 = ET.parse(pa + os.sep + "base_metadata_overall.xml").getroot()

        iterator2 = root2.getiterator("descript")
        for item in iterator2:
            if item.tag == "abstract":
                old_abstract = item.find("abstract")
                old_abstract_text = old_abstract.text
                item.remove(old_abstract)
                new_symbol_abstract = ET.SubElement(item, "title")
                new_symbol_abstract.text = correct_abstract                
        tree = ET.ElementTree(root2)
        tree.write(pa + os.sep + "base_metadata_overall.xml")
        print "created --- " + filename + " metadata"

但是现在,我需要:

1)搜索xml,并获取 attr标记之间的所有内容,例如:

1) search an xml and grab everything between "attr" tags, below is example:

<attr><attrlabl Sync="TRUE">OBJECTID</attrlabl><attalias Sync="TRUE">ObjectIdentifier</attalias><attrtype Sync="TRUE">OID</attrtype><attwidth Sync="TRUE">4</attwidth><atprecis Sync="TRUE">0</atprecis><attscale Sync="TRUE">0</attscale><attrdef Sync="TRUE">Internal feature number.</attrdef></attr>

2)现在,我需要打开一个不同的xml并搜索同一 attr之间的所有内容标记并替换为以上内容。

2) Now, I need to open a different xml and search for all content between the same "attr" tag and replace with the above.

基本上,我以前在做什么,但是忽略了 attr标签之间的子元素,属性等,而将其视为文本。

Basically, what I was doing before, but ignoring subelements, attributes, ect... between "attr" tags and treat it like text.

谢谢!!!

请多多包涵,这个论坛(发布的内容)与Im以前有所不同!

Please bear with me, this forum is a little different (posting) then Im used to!

这是我到目前为止的内容:

Here's what I have so far:

import os, sys, glob, re, xml.etree.ElementTree as ET
from lxml import etree

path = r"C:\\temp\\python\\xml"
for fn in os.listdir(path):
    filepaths = glob.glob(path + os.sep + fn + os.sep +  "*overall.xml")
    for filepath in filepaths:
            (pa, filename) = os.path.split(filepath)

            xml = open(pa + os.sep + "attributes.xml")
            xmltext = xml.read()
            correct_attrs = re.findall("<attr> (.*?)</attr>",xmltext,re.DOTALL)
            for item in correct_attrs:
                correct_attribute = "<attr>" + item + "</attr>"

                xml2 = open(pa + os.sep + "base_metadata_overall.xml")
                xmltext2 = xml2.read()
                old_attrs = re.findall("<attr>(.*?)</attr>",xmltext,re.DOTALL)
                for item2 in old_attrs:
                    old_attribute = "<attr>" + item + "</attr>"               



                    old = etree.fromstring(old_attribute)
                    replacement = new.xpath('//attr')
                    for attr in old.xpath('//attr'):
                        attr.getparent().replace(attr, copy.deepcopy(replacement))
                        print lxml.etree.tostring(old)

完成此工作,请参见下文,甚至弄清楚如何导出到新的.xml
但是,如果attr的#是差异从源到目标,出现以下错误,有什么建议吗?

got this working, see below, even figured out how to export to new .xml However, If the # of attr's is dif. from source to dest, I get the following error, any suggestions?

node = replacements.pop()

node = replacements.pop()

IndexError:从空列表中弹出

IndexError: pop from empty list

import os, sys, glob, re, copy, lxml, xml.etree.ElementTree as ET
from lxml import etree
path = r"C:\\temp\\python\\xml"
for fn in os.listdir(path):
filepaths = glob.glob(path + os.sep + fn + os.sep + "*overall.xml")
for filepath in filepaths:
        xmlatributes = open(pa + os.sep + "attributes.xml")
        xmlatributes_txt = xmlatributes.read()
        xmltarget = open(pa + os.sep + "base_metadata_overall.xml")
        xmltarget_txt = xmltarget.read()
        source = lxml.etree.fromstring(xmlatributes_txt)
        dest = lxml.etree.fromstring(xmltarget_txt)            




        replacements = source.xpath('//attr')
        replacements.reverse()


        for attr in dest.xpath('//attr'):
            node = replacements.pop()
            attr.getparent().replace(attr, copy.deepcopy(node))
        #print lxml.etree.tostring(dest)
        tree = ET.ElementTree(dest)
        tree.write (pa + os.sep + "edited_metadata.xml")
        print fn + "--- sucessfully edited"

更新5/16/2011
进行了一些调整,以解决 IndexError:从空列表中弹出上面提到的错误。意识到替换 attr标签将不会总是一对一替换。对于前。有时,源.xml具有20 attr,目标.xml具有25 attr。在这种情况下,一对一的替换会阻塞。

update 5/16/2011 restructured a few things to fix the "IndexError: pop from empty list" error mentioned above. Realized that the replacement of the "attr" tags will not always be a 1-to-1 replacement. For ex. sometimes the source .xml has 20 attr's and the destination .xml has 25 attr's. In this case, the 1-to-1 replacement would choke.

无论如何,以下内容将删除所有attr,然后替换为源attr。它还会检查是否存在另一个标签 subtype(子类型),并将其添加到attr的后面,但要放在详细标签内。

Anyway, the below will remove all attr's, then replace with the source attr's. It also checks for another tag, "subtype" if it exists, it adds them after the attr's, but inside the "detailed" tags.

再次感谢所有提供帮助的人。

thanks again to everyone who helped.

import os, sys, glob, re, copy, lxml, xml.etree.ElementTree as ET
from lxml import etree
path = r"G:\\63D RRC GIS Data\\metadata\\general\\2010_contract"
#path = r"C:\\temp\python\\xml"
for fn in os.listdir(path):
    correct_title = fn.replace ('_', ' ') + " various facilities"
    correct_fc_name = fn.replace ('_', ' ')
    filepaths = glob.glob(path + os.sep + fn + os.sep + "*overall.xml")
    for filepath in filepaths:
        print "-----" + fn + "-----"
        (pa, filename) = os.path.split(filepath)
        xmlatributes = open(pa + os.sep + "attributes.xml")
        xmlatributes_txt = xmlatributes.read()
        xmltarget = open(pa + os.sep + "base_metadata_overall.xml")
        xmltarget_txt = xmltarget.read()
        source = lxml.etree.fromstring(xmlatributes_txt)
        dest = lxml.etree.fromstring(xmltarget_txt)
        replacements = source.xpath('//attr')
        replacesubtypes = source.xpath('//subtype')
        subtype_true_f = len(replacesubtypes)

        attrtag = dest.xpath('//attr')
        #print len(attrtag)
        num_realatrs = len(replacements)
        for n in attrtag:
            n.getparent().remove(n)
        print n.tag + " removed"

        detailedtag = dest.xpath('//detailed')
        for n2 in detailedtag:
            pos = 0
            for realatrs in replacements:
                n2.insert(pos + 1, realatrs)
            print "attr's replaced"
            if subtype_true_f >= 1:
                #print subtype_true_f
                for realsubtypes in replacesubtypes:
                   n2.insert(num_realatrs + 1, realsubtypes)
                print "subtype's replaced"

        tree = ET.ElementTree(dest)
        tree.write (pa + os.sep + "base_metadata_overall_v2.xml")
        print fn + "--- sucessfully edited"


推荐答案

是使用 lxml 进行此操作的示例。我不是完全确定要如何替换< attr /> 节点,但是此示例应提供可重用的模式。

Here is an example of using lxml to do this. I'm not exactly sure how you want the <attr/> nodes replaced, but this example should provide a pattern you can reuse.

更新-我将其更改为将tree2中的每个< attr> 替换为

Update - I changed it to replace each <attr> in tree2 with the corresponding node from tree1, in document order:

import copy
import lxml.etree

xml1 = '''<root><attr><chaos foo="0"/></attr><attr><arena foo="1"/></attr></root>'''
xml2 = '''<tree><attr><one/></attr><attr><two/></attr></tree>'''
tree1 = lxml.etree.fromstring(xml1)
tree2 = lxml.etree.fromstring(xml2)

# select <attr/> nodes from tree1, will be used to replace corresponding
# nodes in tree2
replacements = tree1.xpath('//attr')
replacements.reverse()

for attr in tree2.xpath('//attr'):
    # replace the attr node in tree2 with 'replacement' from tree1
    node = replacements.pop()
    attr.getparent().replace(attr, copy.deepcopy(node))

print lxml.etree.tostring(tree2)

结果:

<tree>
  <attr><chaos foo="0"/></attr>
  <attr><arena foo="1"/></attr>
</tree>

这篇关于搜索/替换xml的内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆