IMAP中的换行符-= -如何解码? [英] Line breaks in IMAP - = - how to decode?

查看:44
本文介绍了IMAP中的换行符-= -如何解码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试制作一个电子邮件刮取器,它可以抓取某些电子邮件,以查找值以将其存储在CSV文件中。我已经尝试了很多方法来解决这个问题,但到目前为止都没有成功。

# Function to get email content part i.e its body part
def get_body(msg):
    if msg.is_multipart():
        return get_body(msg.get_payload(decode=True)).decode()
    else:
        return msg.get_payload(decode=True).decode()
 
# Function to search for a key value pair
def search(key, value, con):
    result, data = con.search(None, key, '"{}"'.format(value))
    return data
 
# Function to get the list of emails under this label
def get_emails(result_bytes):
    print("get email")
    msgs = [] # all the email data are pushed inside an array
    for num in result_bytes[0].split():
        typ, data = con.fetch(num, '(RFC822)')
        msgs.append(data)
    return msgs
 
# this is done to make SSL connection with GMAIL
con = imaplib.IMAP4_SSL(imap_url)
con.login(user, password)
con.select('Inbox')
msg_ids = get_emails(search('SUBJECT', 'TESTTITELPYTHON', con))
for msg in msg_ids[::-1]:
    for sent in msg:
        if type(sent) is tuple:
            print(msg)
            # encoding set as utf-8
            content = sent[1], 'utf-8'
            data = str(content)
 
            # Handling errors related to unicodenecode
            try:
                indexstart = data.find("span")
                data2 = data[indexstart + 5: len(data)]
                indexend = data2.find("</div>")
 
                # printtng the required content which we need
                # to extract from our email i.e our body
                
                waarde = data2[0: indexend]
                test_naam_1 = waarde.split("Naam: ",1)[1]
                echte_naam = test_naam_1.split("Email: ",-1)[0]
            
                email_test = waarde.split("Email: ",1)[1]
                echte_email = email_test.split("Tel nr.: ",-1)[0]
                                    
                tel_test = waarde.split("Tel nr.: ",1)[1]
                echte_tel = tel_test.split("Onderwerp: ",-1)[0]
            
                subj_test = waarde.split("Onderwerp: ",1)[1]
                echte_subj = subj_test.split("Bericht: ",-1)[0]


                print("---ADRESGEGEVENS---")
                print("---Naam: " + echte_naam + "---")
                print("---Naam: " + echte_email + "---")
                print("---Naam: " + echte_tel + "---")
                print("---Naam: " + echte_subj + "---")

现在在我的结果中,我仍然收到这些难看的换行符,它们在我的标记中如下所示:

[(b'12638 (RFC822 {1973}', b'MIME-Version: 1.0
Date: Mon, 25 Oct 2021 16:41:46 +0200
Message-ID: <CAJDn=xsVynQqp7BwYoGZB=v21-AAR5=xcMkQ8D2kXE7ZpYFNNQ@mail.example.com>
Subject: TESTTITELPYTHON
From: Patrick Merkx <patrick@example.nl>
To: Patrick Merkx <patrick@example.nl>
Content-Type: multipart/alternative; boundary="00000000000042e6ae05cf2e5c7e"

--00000000000042e6ae05cf2e5c7e
Content-Type: text/plain; charset="UTF-8"

Contactformulier ingevuld door:
Naam: Patrick Merkx
Email: merkx.patrick@example.com
Tel nr.: 0611381219

Onderwerp: Nog een test

Bericht:
Bericht

--00000000000042e6ae05cf2e5c7e
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><div dir=3D"ltr" class=3D"gmail_signature" data-smart=
mail=3D"gmail_signature"><div dir=3D"ltr"><div><div dir=3D"ltr"><div><div d=
ir=3D"ltr"><div style=3D"font-stretch:normal;font-size:13.33px;line-height:=
19.99px;background:none;border:0px rgb(34,34,34);width:600px;overflow:visib=
le;min-height:0px;outline-width:0px"><span class=3D"gmail-il" style=3D"font=
-size:small">Contactformulier</span><span style=3D"font-size:small">=C2=A0i=
ngevuld door:</span><br style=3D"font-size:small"><span style=3D"font-size:=
small">Naam: Patrick Merkx</span><br style=3D"font-size:small"><span style=
=3D"font-size:small">Email:=C2=A0</span><a href=3D"mailto:merkx.patrick@gma=
il.com" target=3D"_blank" style=3D"font-size:small">merkx.patrick@example.com=
</a><br style=3D"font-size:small"><span style=3D"font-size:small">Tel nr.: =
0611381219</span><br style=3D"font-size:small"><br style=3D"font-size:small=
"><span style=3D"font-size:small">Onderwerp: Nog een test</span><br style=
=3D"font-size:small"><br style=3D"font-size:small"><span style=3D"font-size=
:small">Bericht:</span><br style=3D"font-size:small"><span style=3D"font-si=
ze:small">Bericht</span><br></div></div></div></div></div></div></div></div=
></div>

--00000000000042e6ae05cf2e5c7e--'), b')']
class=3D"gmail-il" style=3D"font=
-size:small">Contactformulier</span><span style=3D"font-size:small">=C2=A0i=
ngevuld door:</span><br style=3D"font-size:small"><span style=3D"font-size:=
small">Naam: Patrick Merkx</span><br style=3D"font-size:small"><span style=
=3D"font-size:small">Email:=C2=A0</span><a href=3D"mailto:merkx.patrick@gma=
il.com" target=3D"_blank" style=3D"font-size:small">merkx.patrick@gmail.com=
</a><br style=3D"font-size:small"><span style=3D"font-size:small">Tel nr.: =
0611381219</span><br style=3D"font-size:small"><br style=3D"font-size:small=
"><span style=3D"font-size:small">Onderwerp: Nog een test</span><br style=
=3D"font-size:small"><br style=3D"font-size:small"><span style=3D"font-size=
:small">Bericht:</span><br style=3D"font-size:small"><span style=3D"font-si=
ze:small">Bericht</span><br>

我也试过剥离Body标签,解码,也尝试了多种解决方案,但到目前为止都不走运。到目前为止,我似乎无法用任何已知的方法删除这些换行符。

我做错了什么?

推荐答案

您正在查看的MIME部分包含Content-Transfer-Encoding: quoted-printable。正确的解码方法是遍历MIME结构并在执行过程中解释各个部分。但是没有必要显式地这样做;Python的email库已经为您完成了这项工作。

from email import message_from_bytes
from email.policy import default

...
msg_ids = get_emails(search('SUBJECT', 'TESTTITELPYTHON', con))
for msg in msg_ids[::-1]:
    for sent in msg:
        if type(sent) is tuple:
            msg = message_from_bytes(sent[1], policy=default)
不幸的是,如果没有这些消息中的MIME结构的示例,我无法确切地告诉您如何处理产生的消息。您可能有类似于msg.get_body(preferencelist=('html', 'plain'))MIME Body Part;的内容,msg.get_body(preferencelist=('html', 'plain'))会将其提取出来,而get_content()结果会提取实际的Body部分。

policy=default关键字参数选择在Python3.6中引入的email.message.EmailMessage对象类,而不是旧版本中的旧email.message.Message对象。

更详细地说,尝试将原始电子邮件正文解码为UTF-8是非常错误的。典型的MIME消息有几个部分,每个部分可能具有不同的编码,其中许多部分肯定不使用UTF-8作为其编码(尽管它正变得越来越流行;但通常情况下,实际的UTF-8将位于内容传输编码之后,该编码将保护它在通过可能不是8位干净的路线传输期间免受损害)。

这篇关于IMAP中的换行符-= -如何解码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆