解析一串多部分数据 [英] Parse a string of multipart data

查看:72
本文介绍了解析一串多部分数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个字符串(此处解码为base64),如下所示:

I have a string (base64 decoded here) that looks like this:

----------------------------212550847697339237761929
Content-Disposition: form-data; name="preferred_name"; filename="file1.rtf"
Content-Type: application/rtf

{\rtf1\ansi\ansicpg1252\cocoartf1504\cocoasubrtf830
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural\partightenfactor0

\f0\fs24 \cf0 testing123FILE1}
----------------------------212550847697339237761929
Content-Disposition: form-data; name="to_process"; filename="file2.rtf"
Content-Type: application/rtf

{\rtf1\ansi\ansicpg1252\cocoartf1504\cocoasubrtf830
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural\partightenfactor0

\f0\fs24 \cf0 testing123FILE212341234}
----------------------------212550847697339237761929--

我在一个简单的网页上生成此文件,该网页通过API网关通过PUT请求将几个文件上传到AWS Lambda脚本.应该注意的是,我从API网关获得的是Base64字符串,然后我将其解码为上面的字符串.

I generate this on a simple webpage that uploads a couple files to a AWS Lambda script through a PUT request with the API Gateway. It should be noted that what I get from the API Gateway is a Base64 string that I then decode into the string above.

上面的字符串是我的Lambda脚本从API网关接收的数据. 我想做的是解析此字符串,以便检索Python 2.7中包含的数据.我已经尝试过cgi类并使用cgi.parse_multipart()方法,但是,无法找到将字符串转换为所需参数的方法.有提示吗?

The string above is the data that my Lambda script receives from the API gateway. What I would like to do is parse this string in order to retrieve the data contained within with Python 2.7. I've experimented with the cgi class and using the cgi.parse_multipart() method, however, I cannot find a way to convert a string to the required arguments. Any tips?

推荐答案

评论:它是否可靠且符合规范?

Comment: is it robust and spec compliant?

只要您的数据符合以下前提条件:

As long as your Data follow this Preconditions:

  • 第一行是边界
  • 以下标头以
  • 终止
  • 每个消息部分都以 boundary
  • 终止
  • The First line is the boundary
  • The Following Header is terminated with a empty Line
  • Each Message Part is terminated with the boundary

评论:如果内容像JPEG流一样是二进制的,该怎么办?

Comment: What if the content is binary like a JPEG stream?

这很容易打破,因为使用了 String 方法,并且根据 New Line 来使用.readline()读取内容.
因此,从BASE64到decode,然后从unpack组成部分,都是错误的方法!

This is likly to break as there are String Methodes used and reading the content is using .readline() which depends on New Line.
Therefore to decode from BASE64 and then unpack Multipart are the wrong Approach!

评论:如果有某个版本在重用公共库

Comment: If there's a version reusing a common library

如果您能够以标准格式提供数据 MIME 您可以使用以下消息:

If you are able to provide your Data as Standard MIME Message you can use the following:

import email
msg = email.message_from_string(mimeHeader+data)
print('is_multipart:{}'.format(msg.is_multipart()))

for part in msg.walk():
    if part.get_content_maintype() == 'multipart':
        continue

    filename = part.get_filename()
    payload = part.get_payload(decode=True)
    print('{} filename:{}\n{}'.format(part.get_content_type(), filename, payload))

输出:

is_multipart:True
application/rtf filename:file1.rtf
b'{\rtf1\x07nsi\x07nsicpg1252\\cocoartf1504\\cocoasubrtf830\n{\x0conttbl\x0c0\x0cswiss\x0ccharset0'... (omitted for brevity)
application/rtf filename:file2.rtf
b'{\rtf1\x07nsi\x07nsicpg1252\\cocoartf1504\\cocoasubrtf830\n{\x0conttbl\x0c0\x0cswiss\x0ccharset0'... (omitted for brevity)


问题:解析多部分数据的字符串

Question: Parse a string of multipart data

例如,纯Python解决方案:

Pure Python Solution, for instance:

import re, io
with io.StringIO(data) as fh:
    parts = []
    part_line = []
    part_fname = None
    new_part = None
    robj = re.compile('.+filename=\"(.+)\"')

    while True:
        line = fh.readline()
        if not line: break

        if not new_part:
            new_part = line[:-1]

        if line.startswith(new_part):
            if part_line:
                parts.append({'filename':part_fname, 'content':''.join(part_line)})
                part_line = []

            while line and line != '\n':
                _match = robj.match(line)
                if _match: part_fname = _match.groups()[0]
                line = fh.readline()
        else:
            part_line.append(line)

for part in parts:
    print(part)

输出:

{'filename': 'file1.rtf', 'content': '{\rtf1\x07nsi\x07nsicpg1252\\cocoartf1504\\cocoasubrtf830\n... (omitted for brevity)
{'filename': 'file2.rtf', 'content': '{\rtf1\x07nsi\x07nsicpg1252\\cocoartf1504\\cocoasubrtf830\n... (omitted for brevity)

使用Python测试:3.4.2

Tested with Python: 3.4.2

这篇关于解析一串多部分数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆