Python - 电子邮件标头解码 UTF-8 [英] Python - email header decoding UTF-8

查看:36
本文介绍了Python - 电子邮件标头解码 UTF-8的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有任何 Python 模块可以帮助将各种形式的编码邮件标头(主要是主题)解码为简单的 - 例如 - UTF-8 字符串?

以下是我拥有的邮件文件中的主题标题示例:

主题:[ 201105311136 ]=?UTF-8?B?IMKnIDE2NSBBYnM=?=.1 AO;主题:[201105161048] GewSt:=?UTF-8?B?IFdlZ2ZhbGwgZGVyIFZvcmzDpHVmaWdrZWl0?=主题:【201105191633】=?UTF-8?B?IERyZWltb25hdHNmcmlzdCBmw7xyIFZlcnBmbGVndW5nc21laHJhdWZ3ZW5kdW4=?==?UTF-8?B?Z2VuIGVpbmVzIFNlZW1hbm5z?=

文本 - 编码字符串 - 文本

文本 - 编码字符串

文本 - 编码字符串 - 编码字符串

Encodig 也可以是 ISO 8859-15 之类的其他东西.

更新1:我忘了提,我试过email.header.decode_header

 用于 message.items() 中的项目:如果项目[0] == '主题':sub = email.header.decode_header(item[1])logging.debug('主题是 %s' % sub )

这个输出

<块引用>

DEBUG:root:Subject 是 [('[ 201101251025 ]ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=.2011 年 1 月',无)]

这并没有真正的帮助.

更新 2:感谢 Ingmar Hupp 在评论中.

第一个示例解码为两个元组的列表:

<块引用><块引用><块引用><块引用>

打印 decode_header("""[ 201105161048 ]GewSt:=?UTF-8?B?IFdlZ2ZhbGwgZGVyIFZvcmzDpHVmaWdrZWl0?=""")
[('[ 201105161048 ] GewSt:', None), (' Wegfall der Vorlxc3xa4ufigkeit','utf-8')]

这总是 [(string, encoding),(string, encoding), ...] 所以我需要一个循环来将所有 [0] 项连接到一个字符串中,或​​者如何将它们全部放入一个字符串中?

<块引用>

主题:[201101251025] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=.2011 年 1 月

不能很好地解码:

<块引用>

print decode_header("""[ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011""")

[('[ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=.Januar 2011', None)]

解决方案

这种类型的编码被称为 MIME 编码字电子邮件 模块可以解码:

from email.header import decode_header打印 decode_header("""=?UTF-8?B?IERyZWltb25hdHNmcmlzdCBmw7xyIFZlcnBmbGVndW5nc21laHJhdWZ3ZW5kdW4=?=""")

这会输出一个元组列表,包含解码的字符串和使用的编码.这是因为该格式在单个标头中支持不同的编码.要将它们合并为单个字符串,您需要将它们转换为共享编码,然后将其连接起来,这可以使用 Python 的 unicode 对象来完成:

from email.header import decode_headerdh = decode_header("""[ 201105161048 ] GewSt:=?UTF-8?B?IFdlZ2ZhbGwgZGVyIFZvcmzDpHVmaWdrZWl0?=""")default_charset = 'ASCII'打印 ''.join([ unicode(t[0], t[1] or default_charset) for t in dh ])

更新 2:

此主题行无法解码的问题:

主题:[ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=.2011 年 1 月^

实际上是发件人的错误,它违反了RFC 2047,第 5 节,第 1 段:出现在定义为*text"的标头字段中的编码字"必须与任何'线性空白'相邻'编码字'或'文本'.

如果需要,您可以通过使用正则表达式预处理这些损坏的标头来解决此问题,该正则表达式在编码字部分之后插入一个空格(除非它在末尾),如下所示:

导入重新header_value = re.sub(r"(=?.*?=)(?!$)", r"1", header_value)

is there any Python module which helps to decode the various forms of encoded mail headers, mainly Subject, to simple - say - UTF-8 strings?

Here are example Subject headers from mail files that I have:

Subject: [ 201105311136 ]=?UTF-8?B?IMKnIDE2NSBBYnM=?=. 1 AO;
Subject: [ 201105161048 ] GewSt:=?UTF-8?B?IFdlZ2ZhbGwgZGVyIFZvcmzDpHVmaWdrZWl0?=
Subject: [ 201105191633 ]
  =?UTF-8?B?IERyZWltb25hdHNmcmlzdCBmw7xyIFZlcnBmbGVndW5nc21laHJhdWZ3ZW5kdW4=?=
  =?UTF-8?B?Z2VuIGVpbmVzIFNlZW1hbm5z?=

text - encoded sting - text

text - encoded string

text - encoded string - encoded string

Encodig could also be something else like ISO 8859-15.

Update 1: I forgot to mention, I tried email.header.decode_header

    for item in message.items():
    if item[0] == 'Subject':
            sub = email.header.decode_header(item[1])
            logging.debug( 'Subject is %s' %  sub )

This outputs

DEBUG:root:Subject is [('[ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011', None)]

which does not really help.

Update 2: Thanks to Ingmar Hupp in the comments.

the first example decodes to a list of two tupels:

print decode_header("""[ 201105161048 ] GewSt:=?UTF-8?B?IFdlZ2ZhbGwgZGVyIFZvcmzDpHVmaWdrZWl0?=""")
[('[ 201105161048 ] GewSt:', None), (' Wegfall der Vorlxc3xa4ufigkeit', 'utf-8')]

is this always [(string, encoding),(string, encoding), ...] so I need a loop to concat all the [0] items to one string or how to get it all in one string?

Subject: [ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011

does not decode well:

print decode_header("""[ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011""")

[('[ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011', None)]

解决方案

This type of encoding is known as MIME encoded-word and the email module can decode it:

from email.header import decode_header
print decode_header("""=?UTF-8?B?IERyZWltb25hdHNmcmlzdCBmw7xyIFZlcnBmbGVndW5nc21laHJhdWZ3ZW5kdW4=?=""")

This outputs a list of tuples, containing the decoded string and the encoding used. This is because the format supports different encodings in a single header. To merge these into a single string you need to convert them into a shared encoding and then concatenate this, which can be accomplished using Python's unicode object:

from email.header import decode_header
dh = decode_header("""[ 201105161048 ] GewSt:=?UTF-8?B?IFdlZ2ZhbGwgZGVyIFZvcmzDpHVmaWdrZWl0?=""")
default_charset = 'ASCII'
print ''.join([ unicode(t[0], t[1] or default_charset) for t in dh ])

Update 2:

The problem with this Subject line not decoding:

Subject: [ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011
                                                                     ^

Is actually the senders fault, which violates the requirement of encoded-words in a header being separated by white-space, specified in RFC 2047, section 5, paragraph 1: an 'encoded-word' that appears in a header field defined as '*text' MUST be separated from any adjacent 'encoded-word' or 'text' by 'linear-white-space'.

If need be, you can work around this by pre-processing these corrupt headers with a regex that inserts a whitespace after the encoded-word part (unless it's at the end), like so:

import re
header_value = re.sub(r"(=?.*?=)(?!$)", r"1 ", header_value)

这篇关于Python - 电子邮件标头解码 UTF-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆