完全流式XML解析器 [英] Fully streaming XML parser

查看:96
本文介绍了完全流式XML解析器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Exchange 请求 base64io .此服务在SOAP XML HTTP响应中返回base64编码的文件.文件内容包含在单个XML元素的一行中. GetAttachment只是一个例子,但问题更普遍.

I'm trying to consume the Exchange GetAttachment webservice using requests, lxml and base64io. This service returns a base64-encoded file in a SOAP XML HTTP response. The file content is contained in a single line in a single XML element. GetAttachment is just an example, but the problem is more general.

我想将解码后的文件内容直接流式传输到磁盘,而不必在任何时候将附件的全部内容存储在内存中,因为附件可能是几百MB.

I would like to stream the decoded file contents directly to disk without storing the entire contents of the attachment in-memory at any point, since an attachment could be several 100 MB.

我尝试过这样的事情:

r = requests.post('https://example.com/EWS/Exchange.asmx', data=..., stream=True)
with open('foo.txt', 'wb') as f:
    for action, elem in lxml.etree.iterparse(GzipFile(fileobj=r.raw)):
    if elem.tag == 't:Content':
        b64_encoder = Base64IO(BytesIO(elem.text))
        f.write(b64_encoder.read())

,但lxml仍将附件的副本存储为elem.text.我有什么办法可以创建一个完全流式的XML解析器,该解析器也可以直接从输入流中流式传输元素的内容?

but lxml still stores a copy of the attachment as elem.text. Is there any way I can create a fully streaming XML parser that also streams the content of an element directly from the input stream?

推荐答案

在这种情况下,请勿使用iterparse. iterparse()方法只能发出元素的开始和结束事件,因此,当找到结束XML标记时,元素中的任何 text 都将提供给您.

Don't use iterparse in this case. The iterparse() method can only issue element start and end events, so any text in an element is given to you when the closing XML tag has been found.

相反,请使用 SAX解析器界面.这是XML解析库的通用标准,用于将解析的数据传递给内容处理程序. ContentHandler.characters()回调以块的形式传递字符数据(假设实现XML库实际上利用了这种可能性).这是ElementTree API的一个较低级别的API,Python标准库已经捆绑了Expat解析器来驱动它.

Instead, use a SAX parser interface. This is a general standard for XML parsing libraries, to pass on parsed data to a content handler. The ContentHandler.characters() callback is passed character data in chunks (assuming that the implementing XML library actually makes use of this possibility). This is a lower level API from the ElementTree API, and and the Python standard library already bundles the Expat parser to drive it.

因此流程变为:

  • 将进入的请求流包装在GzipFile中,以方便解压缩.或者,更好的方法是,设置response.raw.decode_content = True并根据服务器设置的内容编码对请求库进行解压缩.
  • GzipFile实例或原始流传递到.parse()方法 "rel =" noreferrer> xml.sax.make_parser() .然后,解析器继续从流中分块读取数据.通过使用make_parser(),您首先可以启用诸如名称空间处理之类的功能(如果Exchange决定更改用于每个名称空间的短前缀,这可以确保您的代码不会中断).
  • 使用XML数据块调用内容处理程序characters()方法;检查正确的元素启动事件,以便知道何时需要base64数据.您可以一次大块(4个字符的倍数)中解码base64数据,并将其写入文件.我不会在这里使用base64io,只需要自己进行分块即可.
  • wrap the incoming request stream in a GzipFile for easy decompression. Or, better still, set response.raw.decode_content = True and leave decompression to the requests library based on the content-encoding the server has set.
  • Pass the GzipFile instance or raw stream to the .parse() method of a parser created with xml.sax.make_parser(). The parser then proceeds to read from the stream in chunks. By using make_parser() you first can enable features such as namespace handling (which ensures your code doesn't break if Exchange decides to alter the short prefixes used for each namespace).
  • The content handler characters() method is called with chunks of XML data; check for the correct element start event, so you know when to expect base64 data. You can decode that base64 data in chunks of (a multiple of) 4 characters at a time, and write it to a file. I'd not use base64io here, just do your own chunking.

一个简单的内容处理程序可以是:

A simple content handler could be:

from xml.sax import handler
from base64 import b64decode

class AttachmentContentHandler(handler.ContentHandler):
    types_ns = 'http://schemas.microsoft.com/exchange/services/2006/types'

    def __init__(self, filename):
        self.filename = filename

    def startDocument(self):
        self._buffer = None
        self._file = None

    def startElementNS(self, name, *args):
        if name == (self.types_ns, 'Content'):
            # we can expect base64 data next
            self._file = open(self.filename, 'wb')
            self._buffer = []

    def endElementNS(self, name, *args):
        if name == (self.types_ns, 'Content'):
            # all attachment data received, close the file
            try:
                if self._buffer:
                    raise ValueError("Incomplete Base64 data")
            finally:
                self._file.close()
                self._file = self._buffer = None

    def characters(self, data):
        if self._buffer is None:
            return
        self._buffer.append(data)
        self._decode_buffer()

    def _decode_buffer(self):
        remainder = ''
        for data in self._buffer:
            available = len(remainder) + len(data)
            overflow = available % 4
            if remainder:
                data = (remainder + data)
                remainder = ''
            if overflow:
                remainder, data = data[-overflow:], data[:-overflow]
            if data:
                self._file.write(b64decode(data))
        self._buffer = [remainder] if remainder else []

,您将像这样使用它:

import requests
from xml.sax import make_parser, handler

parser = make_parser()
parser.setFeature(handler.feature_namespaces, True)
parser.setContentHandler(AttachmentContentHandler('foo.txt'))

r = requests.post('https://example.com/EWS/Exchange.asmx', data=..., stream=True)
r.raw.decode_content = True  # if content-encoding is used, decompress as we read
parser.parse(r.raw)

这将以最大64KB的块的形式解析输入XML(默认

This will parse the input XML in chunks of up to 64KB (the default IncrementalParser buffer size), so attachment data is decoded in at most 48KB blocks of raw data.

我可能会扩展内容处理程序以获取目标目录,然后查找<t:Name>元素以提取文件名,然后使用该元素将数据提取为找到的每个附件的正确文件名.您还需要验证您实际上是在处理GetAttachmentResponse文档,并处理错误响应.

I'd probably extend the content handler to take a target directory and then look for <t:Name> elements to extract the filename, then use that to extract the data to the correct filename for each attachment found. You'd also want to verify that you are actually dealing with a GetAttachmentResponse document, and handle error responses.

这篇关于完全流式XML解析器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆