Python - 电子邮件头解码UTF-8 [英] Python - email header decoding UTF-8
问题描述
是否有任何Python模块有助于解码各种形式的编码邮件头,主要是Subject,简单的说 - UTF-8字符串?
这里是示例来自邮件文件的主题标题:
主题:[201105311136] =?UTF-8?B?IMKnIDE2NSBBYnM =? =。 1 AO;
主题:[201105161048] GewSt:=?UTF-8?B?IFdlZ2ZhbGwgZGVyIFZvcmzDpHVmaWdrZWl0?=
主题:[201105191633]
=?UTF-8?B?IERyZWltb25hdHNmcmlzdCBmw7xyIFZlcnBmbGVndW5nc21laHJhdWZ3ZW5kdW4 =?=
=?UTF-8?B?Z2VuIGVpbmVzIFNlZW1hbm5z?=
文本编码sting - 文本
文本编码字符串
文本编码字符串 - 编码字符串
更新1:我忘了提及,我尝试过email.header.decode_header
$如果item [0] =='Subject':sub = email,则
$ b
.header.decode_header(item [1])$ b $ b logging.debug('Subject is%s'%sub)
这将输出
DEBUG:root:Subject is [('[201101251025]
ELStAM; =? UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=。Januar 2011',None)]
这不是真的有帮助。
更新2:
感谢Ingmar Hupp的评论。 / p>
第一个例子解码为两个元组的列表:
< blockquote>
& >
[('[201105161048] GewSt:',None),('Wegfall der Vorl\xc3\xa4ufigkeit',
'utf-8')]
这是始终[(string,encoding),(字符串,编码),...],所以我需要一个循环将所有的[0]项连接到一个字符串或如何将它全部在一个字符串中?
主题:[201101251025] ELStAM; =?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=。 Januar 2011
不能解码:
print decode_header([201101251025] ELStAM; =?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=。Januar 2011)
[(' 201101251025] ELStAM; =?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=。Januar 2011',None)]
这种类型的编码称为 MIME encoded-word 而电子邮件模块可解码:
from email.header import decode_header
print decode_header(=?UTF-8?B?IERyZWltb25hdHNmcmlzdCBmw7xyIFZlcnBmbGVndW5nc21laHJhdWZ3ZW5kdW4 =?=)
这将输出一个包含解码字符串和使用的编码的元组列表。这是因为格式支持单个标题中的不同编码。要将它们合并为单个字符串,您需要将它们转换为共享编码,然后将其连接起来,这可以使用Python的unicode对象来实现:
from email.header import decode_header
dh = decode_header([201105161048] GewSt:=?UTF-8?B?IFdlZ2ZhbGwgZGVyIFZvcmzDpHVmaWdrZWl0?=)
default_charset ='ASCII'
print''.join([unicode(t [0],t [1]或default_charset)for d in dh])
更新2:
此主题行的问题不解码:
主题:[201101251025] ELStAM; =?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=。 Januar 2011
^
实际上是发件人的错误,这违反了编码字的要求在标题中以白色空格分隔,在 RFC 2047第5节第1段:出现在定义为* text的标题字段中的编码字必须与linear-white-space与任何相邻的encoded-word或text分开。 / em>
如果需要,您可以通过预处理这些损坏的标头与正则表达式来解决这个问题,该正则表达式在编码字部分之后插入一个空格(除非是在结束时),如下所示:
import re
header_value = re.sub(r(= \ ??* \?=)(?!$),r\1,header_value)
is there any Python module which helps to decode the various forms of encoded mail headers, mainly Subject, to simple - say - UTF-8 strings?
Here are example Subject headers from mail files that I have:
Subject: [ 201105311136 ]=?UTF-8?B?IMKnIDE2NSBBYnM=?=. 1 AO;
Subject: [ 201105161048 ] GewSt:=?UTF-8?B?IFdlZ2ZhbGwgZGVyIFZvcmzDpHVmaWdrZWl0?=
Subject: [ 201105191633 ]
=?UTF-8?B?IERyZWltb25hdHNmcmlzdCBmw7xyIFZlcnBmbGVndW5nc21laHJhdWZ3ZW5kdW4=?=
=?UTF-8?B?Z2VuIGVpbmVzIFNlZW1hbm5z?=
text - encoded sting - text
text - encoded string
text - encoded string - encoded string
Encodig could also be something else like ISO 8859-15.
Update 1: I forgot to mention, I tried email.header.decode_header
for item in message.items():
if item[0] == 'Subject':
sub = email.header.decode_header(item[1])
logging.debug( 'Subject is %s' % sub )
This outputs
DEBUG:root:Subject is [('[ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011', None)]
which does not really help.
Update 2: Thanks to Ingmar Hupp in the comments.
the first example decodes to a list of two tupels:
print decode_header("""[ 201105161048 ] GewSt:=?UTF-8?B?IFdlZ2ZhbGwgZGVyIFZvcmzDpHVmaWdrZWl0?=""")
[('[ 201105161048 ] GewSt:', None), (' Wegfall der Vorl\xc3\xa4ufigkeit', 'utf-8')]
is this always [(string, encoding),(string, encoding), ...] so I need a loop to concat all the [0] items to one string or how to get it all in one string?
Subject: [ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011
does not decode well:
print decode_header("""[ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011""")
[('[ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011', None)]
This type of encoding is known as MIME encoded-word and the email module can decode it:
from email.header import decode_header
print decode_header("""=?UTF-8?B?IERyZWltb25hdHNmcmlzdCBmw7xyIFZlcnBmbGVndW5nc21laHJhdWZ3ZW5kdW4=?=""")
This outputs a list of tuples, containing the decoded string and the encoding used. This is because the format supports different encodings in a single header. To merge these into a single string you need to convert them into a shared encoding and then concatenate this, which can be accomplished using Python's unicode object:
from email.header import decode_header
dh = decode_header("""[ 201105161048 ] GewSt:=?UTF-8?B?IFdlZ2ZhbGwgZGVyIFZvcmzDpHVmaWdrZWl0?=""")
default_charset = 'ASCII'
print ''.join([ unicode(t[0], t[1] or default_charset) for t in dh ])
Update 2:
The problem with this Subject line not decoding:
Subject: [ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011
^
Is actually the senders fault, which violates the requirement of encoded-words in a header being separated by white-space, specified in RFC 2047, section 5, paragraph 1: an 'encoded-word' that appears in a header field defined as '*text' MUST be separated from any adjacent 'encoded-word' or 'text' by 'linear-white-space'.
If need be, you can work around this by pre-processing these corrupt headers with a regex that inserts a whitespace after the encoded-word part (unless it's at the end), like so:
import re
header_value = re.sub(r"(=\?.*\?=)(?!$)", r"\1 ", header_value)
这篇关于Python - 电子邮件头解码UTF-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!