使用python ElementTree的itertree函数并将修改后的树写入输出文件 [英] Using python ElementTree's itertree function and writing modified tree to output file

查看:158
本文介绍了使用python ElementTree的itertree函数并将修改后的树写入输出文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要解析一个非常大的XML文件(约40GB),从其中删除某些元素,然后将结果写入新的xml文件。我一直在尝试使用python ElementTree中的iterparse,但是对于如何修改树并将结果树写入新的XML文件感到困惑。我已经阅读了itertree上的文档,但尚未清除所有内容。有没有简单的方法可以做到这一点?

I need to parse a very large (~40GB) XML file, remove certain elements from it, and write the result to a new xml file. I've been trying to use iterparse from python's ElementTree, but I'm confused about how to modify the tree and then write the resulting tree into a new XML file. I've read the documentation on itertree but it hasn't cleared things up. Are there any simple ways to do this?

谢谢!

编辑:这是我到目前为止的内容

Here's what I have so far.

import xml.etree.ElementTree as ET
import re 

date_pages = []
f=open('dates_texts.xml', 'w+')

tree = ET.iterparse("sample.xml")

for i, element in tree:
    if element.tag == 'page':
        for page_element in element:
            if page_element.tag == 'revision':
                for revision_element in page_element:
                    if revision_element.tag == '{text':
                        if len(re.findall('20\d\d', revision_element.text.encode('utf8'))) == 0:
                            element.clear()


推荐答案

如果您有一个不适合内存的大xml,那么您可以尝试一次将其序列化一个元素。例如,假设< root>< page />< page />< page /> ...< / root> 文档结构和忽略可能的命名空间问题:

If you have a large xml that doesn't fit in memory then you could try to serialize it one element at a time. For example, assuming <root><page/><page/><page/>...</root> document structure and ignoring possible namespace issues:

import xml.etree.cElementTree as etree

def getelements(filename_or_file, tag):
    context = iter(etree.iterparse(filename_or_file, events=('start', 'end')))
    _, root = next(context) # get root element
    for event, elem in context:
        if event == 'end' and elem.tag == tag:
            yield elem
            root.clear() # free memory

with open('output.xml', 'wb') as file:
    # start root
    file.write(b'<root>')

    for page in getelements('sample.xml', 'page'):
        if keep(page):
            file.write(etree.tostring(page, encoding='utf-8'))

    # close root
    file.write(b'</root>')

其中<$ c如果应保留 page ,则$ c> keep(page)返回 True ,例如:

where keep(page) returns True if page should be kept e.g.:

import re

def keep(page):
    # all <revision> elements must have 20xx in them
    return all(re.search(r'20\d\d', rev.text)
               for rev in page.iterfind('revision'))

为了进行比较,要修改 small xml文件,您可以:

For comparison, to modify a small xml file, you could:

# parse small xml
tree = etree.parse('sample.xml')

# remove some root/page elements from xml
root = tree.getroot()
for page in root.findall('page'):
    if not keep(page):
        root.remove(page) # modify inplace

# write to a file modified xml tree
tree.write('output.xml', encoding='utf-8')

这篇关于使用python ElementTree的itertree函数并将修改后的树写入输出文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆