Python用Etree替换XML内容 [英] Python replace XML content with Etree

查看:59
本文介绍了Python用Etree替换XML内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用Python Etree解析器解析和比较2个XML文件,如下所示:

I'd like to parse and compare 2 XML files with the Python Etree parser as follows:

我有2个XML文件,其中包含大量数据。一种是英语(源文件),另一种是对应的法文翻译(目标文件)。
例如:

I have 2 XML files with loads of data. One is in English (the source file), the other one the corresponding French translation (the target file). E.g.:

源文件:

<AB>
  <CD/>
  <EF>

    <GH>
      <id>123</id>
      <IJ>xyz</IJ>
      <KL>DOG</KL>
      <MN>dogs/dog</MN>
      some more tags and info on same level
      <metadata>
        <entry>
           <cl>Translation</cl>
           <cl>English:dog/dogs</cl>
        </entry>
        <entry>
           <string>blabla</string>
           <string>blabla</string>
        </entry>
            some more strings and entries
      </metadata>
    </GH>

  </EF>
  <stuff/>
  <morestuff/>
  <otherstuff/>
  <stuffstuff/>
  <blubb/>
  <bla/>
  <blubbbla>8</blubbla>
</AB>

目标文件看起来完全一样,但是在某些地方没有文本:

The target file looks exactly the same, but has no text at some places:

<MN>chiens/chien</MN>
some more tags and info on same level
<metadata>
  <entry>
    <cl>Translation</cl>
    <cl></cl>
  </entry>

法语目标文件中有一个空的跨语言引用,我想在其中添加信息只要两个宏具有相同的ID,就从英文源文件中获取。
我已经写了一些代码,在其中我用唯一的标签名替换了字符串标签名,以便识别跨语言引用。现在,我要比较2个文件,如果2个宏具有相同的ID,则将法语文件中的空引用与英语文件中的信息交换。我之前曾尝试过minipar解析器,但遇到了麻烦,现在想尝试Etree。我对编程几乎一无所知,并且很难找到。
这是我到目前为止的代码:

The French target file has an empty cross-language reference where I'd like to put in the information from the English source file whenever the 2 macros have the same ID. I already wrote some code in which I replaced the string tag name with a unique tag name in order to identify the cross-language reference. Now I want to compare the 2 files and if 2 macros have the same ID, exchange the empty reference in the French file with the info from the English file. I was trying out the minidom parser before but got stuck and would like to try Etree now. I have hardly any knowledge about programming and find this very hard. Here is the code I have so far:

    macros = ElementTree.parse(english)

    for tag in macros.getchildren('macro'):
        id_ = tag.find('id')
        data = tag.find('cl')
        id_dict[id_.text] = data.text

    macros = ElementTree.parse(french)

    for tag in macros.getchildren('macro'):
        id_ = tag.find('id')
        target = tag.find('cl')
        if target.text.strip() == '':
        target.text = id_dict[id_.text]

    print (ElementTree.tostring(macros))

我是不仅毫无头绪,而且阅读与此相关的其他文章更使我感到困惑。如果有人可以启发我,我将不胜感激:-)

I am more than clueless and reading other posts on this confuses me even more. I'd appreciate it very much if someone could enlighten me :-)

推荐答案

可能还有更多细节需要澄清。这是带有一些调试打印的样本,它说明了这个想法。假定两个文件的结构完全相同,并且您只想在根目录下一层:

There is probably more details to be clarified. Here is the sample with some debug prints that shows the idea. It assumes that both files have exactly the same structure, and that you want to go only one level below the root:

import xml.etree.ElementTree as etree

english_tree = etree.parse('en.xml')
french_tree = etree.parse('fr.xml')

# Get the root elements, as they support iteration
# through their children (direct descendants)
english_root = english_tree.getroot()
french_root = french_tree.getroot()

# Iterate through the direct descendants of the root
# elements in both trees in parallel.
for en, fr in zip(english_root, french_root):
   assert en.tag == fr.tag # check for the same structure
   if en.tag == 'id':
       assert en.text == fr.text # check for the same id

   elif en.tag == 'string':
       if fr.text is None:
           fr.text = en.text
           print en.text      # displaying what was replaced

etree.dump(french_tree)

对于文件的更复杂的结构,可以通过遍历树的所有元素来替换通过节点的直接子代的循环。如果文件的结构完全相同,则以下代码将起作用:

For more complex structures of the file, the loop through the direct children of the node can be replaced by iteration through all the elements of the tree. If the structures of the files are exactly the same, the following code will work:

import xml.etree.ElementTree as etree

english_tree = etree.parse('en.xml')
french_tree = etree.parse('fr.xml')

for en, fr in zip(english_tree.iter(), french_tree.iter()):
   assert en.tag == fr.tag        # check if the structure is the same
   if en.tag == 'id':
       assert en.text == fr.text  # identification must be the same
   elif en.tag == 'string':
       if fr.text is None:
           fr.text = en.text
           print en.text          # display the inserted text

# Write the result to the output file.
with open('fr2.xml', 'w') as fout:
    fout.write(etree.tostring(french_tree.getroot()))

但是,它仅在两个文件具有完全相同的结构时才起作用。让我们遵循在手动完成任务时将使用的算法。首先,我们需要找到空白的法语翻译。然后,应将其替换为具有相同标识的GH元素的英文翻译。搜索元素时使用XPath表达式的子集:

However, it works only in cases when both files have exactly the same structure. Let's follow the algorithm that would be used when the task is to be done manually. Firstly, we need to find the French translation that is empty. Then it should be replaced by the English translation from the GH element with the same identification. A subset of XPath expressions is used in the case when searching for the elements:

import xml.etree.ElementTree as etree

def find_translation(tree, id_):
    # Search fot the GH element with the given identification, and return
    # its translation if found. Otherwise None is returned implicitly.
    for gh in tree.iter('GH'):
       id_elem = gh.find('./id')
       if id_ == id_elem.text:
           # The related GH element found.
           # Find metadata entry, extract the translation.
           # Warning! This is simplification for the fixed position 
           # of the Translation entry.
           me = gh.find('./metadata/entry')
           assert len(me) == 2     # metadata/entry has two elements
           cl1 = me[0]
           assert cl1.text == 'Translation'
           cl2 = me[1]

           return cl2.text


# Body of the program. --------------------------------------------------

english_tree = etree.parse('en.xml')
french_tree = etree.parse('fr.xml')

for gh in french_tree.iter('GH'): # iterate through the GH elements only 
   # Get the identification of the GH section
   id_elem = gh.find('./id')      
   id_ = id_elem.text

   # Find and check the metadata entry, extract the French translation.
   # Warning! This is simplification for the fixed position of the Translation 
   # entry.
   me = gh.find('./metadata/entry')
   assert len(me) == 2     # metadata/entry has two elements
   cl1 = me[0]
   assert cl1.text == 'Translation'
   cl2 = me[1]
   fr_translation = cl2.text

   # If the French translation is empty, put there the English translation
   # from the related element.
   if cl2.text is None:
       cl2.text = find_translation(english_tree, id_)


with open('fr2.xml', 'w') as fout:
   fout.write(etree.tostring(french_tree.getroot()).decode('utf-8'))

这篇关于Python用Etree替换XML内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆