解码RFC 2231标头 [英] Decoding RFC 2231 headers

查看:631
本文介绍了解码RFC 2231标头的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

试图解决此问题,我试图围绕Python中的各种功能旨在支持 RFC 2231 的标准库。该RFC的主要目的似乎是三方面:允许在头参数中进行非ASCII编码,注意给定值的语言,并允许头参数跨越多行。 email.util 图书馆提供了几个功能来处理这方面的各个方面。据我所知,它们的工作原理如下:

Trying to address this issue, I'm trying to wrap my head around the various functions in the Python standard library aimed at supporting RFC 2231. The main aim of that RFC appears to be three-fold: allowing non-ASCII encoding in header parameters, noting the language of a given value, and allowing header parameters to span multiple lines. The email.util library provides several functions to deal with various aspects of this. As far as I can tell, they work as follows:

decode_rfc2231 仅将此类参数的值拆分为其部分,如下所示:

decode_rfc2231 only splits the value of such a parameter into its parts, like this:

>>> email.utils.decode_rfc2231("utf-8''T%C3%A4st.txt")
['utf-8', '', 'T%C3%A4st.txt']

decode_params 负责检测RFC2231编码的参数。它收集属于一起的部分,并将url编码的字符串解码为字节序列。然而,该字节序列被编码为latin1。并且所有值都用引号括起来。此外,第一个参数有一些特殊的处理,它仍然必须是两个元素的元组,但是这两个元素在没有修改的情况下被传递给结果。

decode_params takes care of detecting RFC2231-encoded parameters. It collects parts which belong together, and also decodes the url-encoded string to a byte sequence. This byte sequence, however, is then encoded as latin1. And all values are enclosed in quotation marks. Furthermore, there is some special handling for the first argument, which still has to be a tuple of two elements, but those two get passed to the result without modification.

>>> email.utils.decode_params([
...   (1,2),
...   ("foo","bar"),
...   ("name*","utf-8''T%C3%A4st.txt"),
...   ("baz*0","two"),("baz*1","-part")])
[(1, 2), ('foo', '"bar"'), ('baz', '"two-part"'), ('name', ('utf-8', '', '"Täst.txt"'))]

collapse_rfc2231_value 可用于转换此三倍将编码,语言和字节序列转换为正确的unicode字符串。然而,令我困惑的是,如果输入是这样的三倍,那么引号将被转移到输出。另一方面,如果输入是单引号字符串,那么这些引号将被删除。

collapse_rfc2231_value can be used to convert this triple of encoding, language and byte sequence into a proper unicode string. What has me confused, though, is the fact that if the input was such a triple, then the quotes will be carried over to the output. If, on the other hand, the input was a single quoted string, then these quotes will be removed.

>>> [(k, email.utils.collapse_rfc2231_value(v)) for k, v in
...  email.utils.decode_params([
...   (1,2),
...   ("foo","bar"),
...   ("name*","utf-8''T%C3%A4st.txt"),
...   ("baz*0","two"),("baz*1","-part")])[1:]]
[('foo', 'bar'), ('baz', 'two-part'), ('name', '"Täst.txt"')]

所以似乎为了使用所有这些机器,我必须再添加一个步骤来取消我遇到的任何元组的第三个元素。这是真的吗,还是我在这里错过了一些观点?我必须在源代码的帮助下弄清楚上面的很多内容,因为文档在细节上有点模糊。我无法想象这有选择性的不引用的背后可能是什么。有没有意义呢?

So it seems that in order to use all this machinery, I'd have to add yet another step to unquote the third element of any tuple I'd encounter. Is this true, or am I missing some point here? I had to figure out a lot of the above with help from the source code, since the docs are a bit vague on the details. I cannot imagine what could be the point behind this selective unquoting. Is there a point to it?

关于如何使用这些功能的最佳参考是什么?

到目前为止,我发现的最好的是 email.message.Message 实施。在那里,这个过程似乎与上面概述的过程大致相同,但每个字段都通过 decode_params 之后> _unquotevalue ,且只有 get_filename get_boundary 折叠其值,所有其他人返回而是一个元组。我希望有更多有用的东西。

The best I found so far is the email.message.Message implementation. There, the process seems to be roughly the one outlined above, but every field gets unquoted via _unquotevalue after the decode_params, and only get_filename and get_boundary collapse their values, all others return a tuple instead. I hope there is something more useful.

推荐答案

目前来自 email.utils email.message 之外,很少使用$ c>。大多数用户似乎更喜欢直接使用 email.message.Message 。甚至还有一些旧的问题报告有关向Python添加单元测试(当然可以作为示例),即使我不确定它与 email.util 的关系。

Currently the functions from email.utils are rarely used besides within email.message. Most users seem to prefer using email.message.Message directly. There's even a somewhat old issue report on adding unit tests (that would certainly be usable as examples) to Python, even if I'm not sure on how it relates to email.util.

我发现的一个简短例子是此博文,但不包含多于一次的句子和关于RFC2231解析的一些SLOC信息。但是,作者指出,许多MTA使用 RFC2047 。根据您的用例,这可能也是一个问题。

A short example I found is this blogpost which, however, doesn't contain more than once sentence and a few SLOCs of information about RFC2231 parsing. The author notes, however, that many MTAs use RFC2047 instead. Depending on your usecase, that might also be an issue.

从我能找到的几个例子来看,我假设您使用电子邮件进行解析。 util 是唯一的方法,即使长列表理解有点难看。

Judging from the few examples I could find I assume your way of parsing using email.util is the only way to go, even if the long list comprehension is somewhat ugly.

由于在某些方面缺乏示例编写一个新的RFC2231解析器(如果你真的需要一个更好的,可能更快或更漂亮的代码库)可能是明智的。一个新的实现可以基于现有的实现,如 Dovecot RFC2231解析器出于兼容性原因(你甚至可以使用 Dovecot单元测试。因为C代码对我来说似乎很复杂,因为除了 email.util之外我找不到任何python实现的电子邮件的反向移动电子邮件移植到Python的任务并不容易(注意Dovecot是 LGPL许可,这可能是您项目中的一个问题)

Because of the lack of examples in some respect it could be wise to write a new RFC2231 parser (if you really need a better, maybe faster or more beautiful codebase). A new implementation could be based on existing implementations like the Dovecot RFC2231 parser for compatibility reasons (you could even use the Dovecot unit test. As the C code seems quite complex to me and since I can't find any python implementation besides email.util and Python2 backports of email.util the task of porting to Python won't be easy (note that Dovecot is LGPL-licensed, which might be an issue in your project)

我认为 email.util RFC2231 API并非设计用于简单的独立使用,而是更多作为一堆实用工具方法在 email.message中使用。消息

I think the email.util RFC2231 API has not been designed for easy standalone usage but more as a pile of utility methods for use in email.message.Message.

这篇关于解码RFC 2231标头的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆