Python - 电子邮件头解码UTF-8 [英] Python - email header decoding UTF-8

查看：275 发布时间：2017/8/8 19:09:12 python email email-headers

本文介绍了Python - 电子邮件头解码UTF-8的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

是否有任何Python模块有助于解码各种形式的编码邮件头，主要是Subject，简单的说 - UTF-8字符串？

这里是示例来自邮件文件的主题标题：

 主题：[201105311136] =？UTF-8？B？IMKnIDE2NSBBYnM =？ =。 1 AO; 
主题：[201105161048] GewSt：=？UTF-8？B？IFdlZ2ZhbGwgZGVyIFZvcmzDpHVmaWdrZWl0？= 
主题：[201105191633] 
 =？UTF-8？B？IERyZWltb25hdHNmcmlzdCBmw7xyIFZlcnBmbGVndW5nc21laHJhdWZ3ZW5kdW4 =？= 
 =？UTF-8？B？Z2VuIGVpbmVzIFNlZW1hbm5z？=

文本编码sting - 文本

文本编码字符串

文本编码字符串 - 编码字符串

更新1：我忘了提及，我尝试过email.header.decode_header

$如果item [0] =='Subject'：
sub = email，则

$ b

  .header.decode_header（item [1]）$ b $ b logging.debug（'Subject is％s'％sub）

这将输出

DEBUG：root：Subject is [（'[201101251025]
ELStAM; =？ UTF-8？B？IFZlcmbDvGd1bmcgdm9tIDIx？=。Januar 2011'，None）]

这不是真的有帮助。

更新2：
感谢Ingmar Hupp的评论。 / p>

第一个例子解码为两个元组的列表：

< blockquote>
& >
[（'[201105161048] GewSt：'，None），（'Wegfall der Vorl\xc3\xa4ufigkeit'，
'utf-8'）]

这是始终[（string，encoding），（字符串，编码），...]，所以我需要一个循环将所有的[0]项连接到一个字符串或如何将它全部在一个字符串中？

主题：[201101251025] ELStAM; =？UTF-8？B？IFZlcmbDvGd1bmcgdm9tIDIx？=。 Januar 2011

不能解码：

print decode_header（[201101251025] ELStAM; =？UTF-8？B？IFZlcmbDvGd1bmcgdm9tIDIx？=。Januar 2011）

[（' 201101251025] ELStAM; =？UTF-8？B？IFZlcmbDvGd1bmcgdm9tIDIx？=。Januar 2011'，None）]

解决方案

这种类型的编码称为 MIME encoded-word 而电子邮件模块可解码：

  from email.header import decode_header 
 print decode_header（=？UTF-8？B？IERyZWltb25hdHNmcmlzdCBmw7xyIFZlcnBmbGVndW5nc21laHJhdWZ3ZW5kdW4 =？=）

这将输出一个包含解码字符串和使用的编码的元组列表。这是因为格式支持单个标题中的不同编码。要将它们合并为单个字符串，您需要将它们转换为共享编码，然后将其连接起来，这可以使用Python的unicode对象来实现：

  from email.header import decode_header 
 dh = decode_header（[201105161048] GewSt：=？UTF-8？B？IFdlZ2ZhbGwgZGVyIFZvcmzDpHVmaWdrZWl0？=）
 default_charset ='ASCII' 
 print''.join（[unicode（t [0]，t [1]或default_charset）for d in dh]）

更新2：

此主题行的问题不解码：

 主题：[201101251025] ELStAM; =？UTF-8？B？IFZlcmbDvGd1bmcgdm9tIDIx？=。 Januar 2011 
 ^

实际上是发件人的错误，这违反了编码字的要求在标题中以白色空格分隔，在 RFC 2047第5节第1段：出现在定义为* text的标题字段中的编码字必须与linear-white-space与任何相邻的encoded-word或text分开。 / em>

如果需要，您可以通过预处理这些损坏的标头与正则表达式来解决这个问题，该正则表达式在编码字部分之后插入一个空格（除非是在结束时），如下所示：

  import re 
 header_value = re.sub（r（= \ ？？* \？=）（？！$），r\1，header_value）

is there any Python module which helps to decode the various forms of encoded mail headers, mainly Subject, to simple - say - UTF-8 strings?

Here are example Subject headers from mail files that I have:

Subject: [ 201105311136 ]=?UTF-8?B?IMKnIDE2NSBBYnM=?=. 1 AO;
Subject: [ 201105161048 ] GewSt:=?UTF-8?B?IFdlZ2ZhbGwgZGVyIFZvcmzDpHVmaWdrZWl0?=
Subject: [ 201105191633 ]
  =?UTF-8?B?IERyZWltb25hdHNmcmlzdCBmw7xyIFZlcnBmbGVndW5nc21laHJhdWZ3ZW5kdW4=?=
  =?UTF-8?B?Z2VuIGVpbmVzIFNlZW1hbm5z?=

text - encoded sting - text

text - encoded string

text - encoded string - encoded string

Encodig could also be something else like ISO 8859-15.

Update 1: I forgot to mention, I tried email.header.decode_header

    for item in message.items():
    if item[0] == 'Subject':
            sub = email.header.decode_header(item[1])
            logging.debug( 'Subject is %s' %  sub )

This outputs

DEBUG:root:Subject is [('[ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011', None)]

which does not really help.

Update 2: Thanks to Ingmar Hupp in the comments.

the first example decodes to a list of two tupels:

print decode_header("""[ 201105161048 ] GewSt:=?UTF-8?B?IFdlZ2ZhbGwgZGVyIFZvcmzDpHVmaWdrZWl0?=""")
[('[ 201105161048 ] GewSt:', None), (' Wegfall der Vorl\xc3\xa4ufigkeit', 'utf-8')]

is this always [(string, encoding),(string, encoding), ...] so I need a loop to concat all the [0] items to one string or how to get it all in one string?

Subject: [ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011

does not decode well:

print decode_header("""[ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011""")

[('[ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011', None)]

解决方案

This type of encoding is known as MIME encoded-word and the email module can decode it:

from email.header import decode_header
print decode_header("""=?UTF-8?B?IERyZWltb25hdHNmcmlzdCBmw7xyIFZlcnBmbGVndW5nc21laHJhdWZ3ZW5kdW4=?=""")

This outputs a list of tuples, containing the decoded string and the encoding used. This is because the format supports different encodings in a single header. To merge these into a single string you need to convert them into a shared encoding and then concatenate this, which can be accomplished using Python's unicode object:

from email.header import decode_header
dh = decode_header("""[ 201105161048 ] GewSt:=?UTF-8?B?IFdlZ2ZhbGwgZGVyIFZvcmzDpHVmaWdrZWl0?=""")
default_charset = 'ASCII'
print ''.join([ unicode(t[0], t[1] or default_charset) for t in dh ])

Update 2:

The problem with this Subject line not decoding:

Subject: [ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011
                                                                     ^

Is actually the senders fault, which violates the requirement of encoded-words in a header being separated by white-space, specified in RFC 2047, section 5, paragraph 1: an 'encoded-word' that appears in a header field defined as '*text' MUST be separated from any adjacent 'encoded-word' or 'text' by 'linear-white-space'.

If need be, you can work around this by pre-processing these corrupt headers with a regex that inserts a whitespace after the encoded-word part (unless it's at the end), like so:

import re
header_value = re.sub(r"(=\?.*\?=)(?!$)", r"\1 ", header_value)

这篇关于Python - 电子邮件头解码UTF-8的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Python - 电子邮件头解码UTF-8 [英] Python - email header decoding UTF-8

问题描述

更新2：

Update 2:

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python - 电子邮件头解码UTF-8 [英] Python - email header decoding UTF-8

问题描述

更新2：

Update 2:

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭