如何在UTF-8中获取电子邮件? [英] How to get Email in UTF-8?

查看:80
本文介绍了如何在UTF-8中获取电子邮件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Python脚本来获取人们在我的电子邮件地址上发送的邮件.

I am doing a Python script to get the mail sent by people on my email address.

我正在使用ImapClient模块,并且得到了电子邮件的内容,但是原型很奇怪,我所有的UTF-8字符都经过编码,如下所示:

I am using the ImapClient module, and I got the content of the e-mail but prototyped strangely, all my UTF-8 Characters are encoded, like this :

否= C3 = AB1

No=C3=ABl

这是我的代码:

    email_message = email.message_from_bytes(message_data[b'RFC822'])
    print(email_message.get_payload(0))

我还尝试在 get_payload 中添加 decode = True 参数,但它返回了 NoneType .

I tried also to add the decode=True arguments in my get_payload, but it returns me a NoneType.

推荐答案

您必须首先确定您感兴趣的电子邮件部分.然后,根据该部分的编码对该部分的内容进行解码.每个部分可以具有不同的编码和/或字符集.如果您对电子邮件的主体感兴趣,通常这是第一部分,它可以是html或纯文本,具体取决于发送电子邮件的程序(某些用户代理(例如gmail)将同时包含这两种形式)).

You would have to first identify the email part you are interested in. Then, you would decode the part's content, according to that part's encoding. Each part may have a different encoding and/or character set. If you're interested in the main body of the email, this is usually the first part, which could be html, or could be plain text, depending on the program that sent it (some user agents, like gmail, will include both forms).

您可以使用电子邮件模块的 EmailMessage.walk()作用于您的消息对象,以查看各种附件及其各自的内容类型.这些部分之间用特殊的边界"字符串(通常是随机的)分开,该字符串在消息正文中不会出现(以避免歧义).让电子邮件模块为您遍历零件更容易-尤其是因为零件可以嵌套.

You could use the email module's EmailMessage.walk() function over your message object to see the various attachment and their respective content types. The parts are separated from one another with a special "boundary" string (often random) that does not occur in the message body (to avoid ambiguity). It's easier to let the email module walk the parts for you -- especially since parts can nest.

您在问题中显示的文本片段似乎是带引号的可打印编码.您可以在此处找到从quoted-printable到utf-8的转换示例:更改";可打印的"编码为"utf-8"

The snippet of text that you show in your question appears to be quoted-printable encoded. You can find an example conversion from quoted-printable to utf-8 here: Change "Quoted-printable" encoding to "utf-8"

示例:

我在下面添加了一个示例模拟原始消息,该消息将代表构成EmailMessage对象的字节.在电子邮件中,每个部分/部分(主体,附件等)可以具有不同的内容类型,字符集和传输编码.零件可以嵌入子零件,但是电子邮件通常通常只有一个平面结构.对于作为附件的零件,通常会找到一个content-disposition值,该值指示文件内容的建议文件名.

I'm adding an example mock raw message below, which would represent the bytes that form the EmailMessage object. In an email, each section/part (main body, attachments, etc) can have a different content-type, charset, and transfer-encoding. Parts can embed sub-parts, but email messages will commonly have just a flat structure. For parts that are attachments, it would be also common to find a content-disposition value, which would indicate a suggested filename for the file content.

Subject: Woah
From: "Sébastien" <seb@example.org>
To: Bob <bob@example.org>
Content-Type: multipart/alternative; boundary="000000000000690fec05765c6a66"

--000000000000690fec05765c6a66
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

S=C3=A9bastien est un pr=C3=A9nom.

--000000000000690fec05765c6a66
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div di=
r=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"lt=
r"><div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div=
dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr">...

...

一旦选择了感兴趣的部分,就必须使用该部分的编码设置来正确转换有效负载.您首先要撤消任何传输编码(例如quoted-printable),然后根据字符集对结果字符串进行解码.

Once you select your part of interest, you have to use the encoding settings of that part to convert the payload properly. You would first undo any transfer encoding (e.g. quoted-printable), and decode the resulting string according to the charset.

如果所需部分的字符集已经是 UTF-8 ,那么您要做的就是撤消内容传输编码(例如,删除带引号的可打印序列).但是,如果部分的字符集不同,例如Latin-1,则必须从字节转到unicode,然后再从unicode转到utf8:

If the charset of the part you want is already UTF-8, then all you would have to do is to undo the content-transfer-encoding (e.g. remove quoted-printable sequences). However if the part's charset was different, say Latin-1, you would have to go from bytes to unicode and then back from unicode to utf8:

# remove quoted-printable encoding
unquoted = quopri.decodestring(mime_part_payload)

# latin-1 in this case is the charset of the mime part header
tmp_unicode = unquoted.decode('latin-1', errors='ignore')

# encode to desired encoding
u8 = tmp_unicode.encode('utf-8')

这篇关于如何在UTF-8中获取电子邮件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆