Python - 电子邮件头解码UTF-8 [英] Python - email header decoding UTF-8

查看:275
本文介绍了Python - 电子邮件头解码UTF-8的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有任何Python模块有助于解码各种形式的编码邮件头,主要是Subject,简单的说 - UTF-8字符串?



这里是示例来自邮件文件的主题标题:

 主题:[201105311136] =?UTF-8?B?IMKnIDE2NSBBYnM =? =。 1 AO; 
主题:[201105161048] GewSt:=?UTF-8?B?IFdlZ2ZhbGwgZGVyIFZvcmzDpHVmaWdrZWl0?=
主题:[201105191633]
=?UTF-8?B?IERyZWltb25hdHNmcmlzdCBmw7xyIFZlcnBmbGVndW5nc21laHJhdWZ3ZW5kdW4 =?=
=?UTF-8?B?Z2VuIGVpbmVzIFNlZW1hbm5z?=

文本编码sting - 文本



文本编码字符串



文本编码字符串 - 编码字符串



更新1:我忘了提及,我尝试过email.header.decode_header

$如果item [0] =='Subject':
sub = email,则

$ b

  .header.decode_header(item [1])$ ​​b $ b logging.debug('Subject is%s'%sub)

这将输出


DEBUG:root:Subject is [('[201101251025]
ELStAM; =? UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=。Januar 2011',None)]




这不是真的有帮助。



更新2:
感谢Ingmar Hupp的评论。 / p>

第一个例子解码为两个元组的列表:



< blockquote>
& >
[('[201105161048] GewSt:',None),('Wegfall der Vorl\xc3\xa4ufigkeit',
'utf-8')]





这是始终[(string,encoding),(字符串,编码),...],所以我需要一个循环将所有的[0]项连接到一个字符串或如何将它全部在一个字符串中?


主题:[201101251025] ELStAM; =?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=。 Januar 2011


不能解码:


print decode_header([201101251025] ELStAM; =?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=。Januar 2011)



[(' 201101251025] ELStAM; =?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=。Januar 2011',None)]



解决方案

这种类型的编码称为 MIME encoded-word 电子邮件模块可解码:

  from email.header import decode_header 
print decode_header(=?UTF-8?B?IERyZWltb25hdHNmcmlzdCBmw7xyIFZlcnBmbGVndW5nc21laHJhdWZ3ZW5kdW4 =?=)

这将输出一个包含解码字符串和使用的编码的元组列表。这是因为格式支持单个标题中的不同编码。要将它们合并为单个字符串,您需要将它们转换为共享编码,然后将其连接起来,这可以使用Python的unicode对象来实现:

  from email.header import decode_header 
dh = decode_header([201105161048] GewSt:=?UTF-8?B?IFdlZ2ZhbGwgZGVyIFZvcmzDpHVmaWdrZWl0?=)
default_charset ='ASCII'
print''.join([unicode(t [0],t [1]或default_charset)for d in dh])



更新2:



此主题行的问题不解码:

 主题:[201101251025] ELStAM; =?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=。 Januar 2011 
^

实际上是发件人的错误,这违反了编码字的要求在标题中以白色空格分隔,在 RFC 2047第5节第1段:出现在定义为* text的标题字段中的编码字必须与linear-white-space与任何相邻的encoded-word或text分开。 / em>



如果需要,您可以通过预处理这些损坏的标头与正则表达式来解决这个问题,该正则表达式在编码字部分之后插入一个空格(除非是在结束时),如下所示:

  import re 
header_value = re.sub(r(= \ ??* \?=)(?!$),r\1,header_value)


is there any Python module which helps to decode the various forms of encoded mail headers, mainly Subject, to simple - say - UTF-8 strings?

Here are example Subject headers from mail files that I have:

Subject: [ 201105311136 ]=?UTF-8?B?IMKnIDE2NSBBYnM=?=. 1 AO;
Subject: [ 201105161048 ] GewSt:=?UTF-8?B?IFdlZ2ZhbGwgZGVyIFZvcmzDpHVmaWdrZWl0?=
Subject: [ 201105191633 ]
  =?UTF-8?B?IERyZWltb25hdHNmcmlzdCBmw7xyIFZlcnBmbGVndW5nc21laHJhdWZ3ZW5kdW4=?=
  =?UTF-8?B?Z2VuIGVpbmVzIFNlZW1hbm5z?=

text - encoded sting - text

text - encoded string

text - encoded string - encoded string

Encodig could also be something else like ISO 8859-15.

Update 1: I forgot to mention, I tried email.header.decode_header

    for item in message.items():
    if item[0] == 'Subject':
            sub = email.header.decode_header(item[1])
            logging.debug( 'Subject is %s' %  sub )

This outputs

DEBUG:root:Subject is [('[ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011', None)]

which does not really help.

Update 2: Thanks to Ingmar Hupp in the comments.

the first example decodes to a list of two tupels:

print decode_header("""[ 201105161048 ] GewSt:=?UTF-8?B?IFdlZ2ZhbGwgZGVyIFZvcmzDpHVmaWdrZWl0?=""")
[('[ 201105161048 ] GewSt:', None), (' Wegfall der Vorl\xc3\xa4ufigkeit', 'utf-8')]

is this always [(string, encoding),(string, encoding), ...] so I need a loop to concat all the [0] items to one string or how to get it all in one string?

Subject: [ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011

does not decode well:

print decode_header("""[ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011""")

[('[ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011', None)]

解决方案

This type of encoding is known as MIME encoded-word and the email module can decode it:

from email.header import decode_header
print decode_header("""=?UTF-8?B?IERyZWltb25hdHNmcmlzdCBmw7xyIFZlcnBmbGVndW5nc21laHJhdWZ3ZW5kdW4=?=""")

This outputs a list of tuples, containing the decoded string and the encoding used. This is because the format supports different encodings in a single header. To merge these into a single string you need to convert them into a shared encoding and then concatenate this, which can be accomplished using Python's unicode object:

from email.header import decode_header
dh = decode_header("""[ 201105161048 ] GewSt:=?UTF-8?B?IFdlZ2ZhbGwgZGVyIFZvcmzDpHVmaWdrZWl0?=""")
default_charset = 'ASCII'
print ''.join([ unicode(t[0], t[1] or default_charset) for t in dh ])

Update 2:

The problem with this Subject line not decoding:

Subject: [ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011
                                                                     ^

Is actually the senders fault, which violates the requirement of encoded-words in a header being separated by white-space, specified in RFC 2047, section 5, paragraph 1: an 'encoded-word' that appears in a header field defined as '*text' MUST be separated from any adjacent 'encoded-word' or 'text' by 'linear-white-space'.

If need be, you can work around this by pre-processing these corrupt headers with a regex that inserts a whitespace after the encoded-word part (unless it's at the end), like so:

import re
header_value = re.sub(r"(=\?.*\?=)(?!$)", r"\1 ", header_value)

这篇关于Python - 电子邮件头解码UTF-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆