电子邮件附件编码错误 [英] Wrong encoding of email attachment

查看:163
本文介绍了电子邮件附件编码错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个在Windows上运行的python 2.7脚本。它登录gmail,检查新的电子邮件和附件:

 #!/ usr / bin / env python 
# - * - 编码:utf-8 - * -

file_types = [pdf,doc,docx]#下载附件与这些扩展

login =login
passw =password

imap_server =imap.gmail.com
smtp_server =smtp.gmail.com
smtp_port = 587
$ b从smtplib导入SMTP
from email.parser import HeaderParser
from email.MIMEText import MIMEText
import sys
import imaplib
import getpass
import email
import datetime
import os
import time

如果__name__ ==__main__:
try:
while True:
session = imaplib.IMAP4_SSL(imap_server)
try:
rv,data = session.login(login,passw)
打印已登录:,rv
除了imaplib.IMAP4.error:
打印登录失败!
sys.exit(1)

rv,mailboxes = session.list()
rv,data = session.select(foldr)
rv,data = session .search(无,(UNSEEN))
数据中的num [0] .split():
rv,data = session.fetch(num,(RFC822))
for rpart in data:
if isinstance(rpart,tuple):
msg = email.message_from_string(rpart [1])$ ​​b $ b to = email.utils.parseaddr(msg [From ])[1]
text = data [0] [1]
msg = email.message_from_string(text)
got = []
部分在msg.walk() :
如果part.get_content_maintype()=='multipart':
continue
如果part.get('Content-Disposition')为无:
continue
filename = part.get_fi lename()
printfile:,filename
printExtention:,filename.split(。)[-1]
如果filename.split(。)[ -1]不在file_types中:
continue
data = part.get_payload(decode = True)
如果不是数据:
continue
date = datetime.datetime.now ().strftime(%Y-%m-%d)
如果不是os.path.isdir(CONTENT):
os.mkdir(CONTENT)
if不是os.path.isdir(CONTENT /+ date):
os.mkdir(CONTENT /+ date)
ftime = datetime.datetime.now()。strftime(%H- %M-%S)
new_file =CONTENT /+ date +/+ ftime +_+ filename
f = open(new_file,'wb')
print从%s%(new_file,to)
获取新文件%s got.append(filename.encode(utf-8))
f.write(data)
f.close()
session.close()
session.logout ()
time.sleep(60)
除了:
打印TARFUN!

问题是最后一次打印读取垃圾:

=?UTF-8?B?0YfQsNGB0YLRjCAxINGC0LXQutGB0YIg0LzQtdGC0L7QtNC40YfQutC4LmRv?=

例如
,所以稍后的检查不起作用。在linux上它的工作很好。
现在我尝试d / e [n]代码文件名到utf-8。但它什么也没做。感谢提前。

解决方案

如果您阅读了定义文件名字段的规范, RFC 2183 ,第2.3节,它说:


目前的 [RFC 2045] 语法限制了参数值(因此
Content-Disposition文件名)到US-ASCII。我们认识到允许在文件名中使用任意字符集的巨大的
的可取性,但是
它定义必要的
机制超出了本文档的范围。我们预计有一天会修改基本的 [RFC 1521] 的价值
规范允许使用非US-ASCII
字符,此时应在
Content-Disposition filename参数中使用相同的机制。


有建议的RFC来处理这个问题。特别地,有人建议将文件名按照 encoded-word rel =nofollow> RFC 5987 RFC 2047 RFC 2231 。简而言之,这意味着RFC 2047格式:

 =?字符集?编码?编码文本?=

...或RFC 2231格式:

 =?字符集[*语言]?编码文本?=

某些邮件代理已经在使用此功能,其他人不知道该怎么办Python 2.x中的电子邮件包是不知道该怎么做的。 (这可能是Python 3.x中的更新版本,或者它将来可能会发生变化,但是如果你想坚持使用2.x,那将不会帮助你)所以,如果你想解析这个,你必须自己做。



在你的例子中,你有一个RFC 2047格式的文件名,字符集 UTF-8 (可直接用作Python编码名称),编码 B ,这意味着Base-64和内容 0YfQsNGB0YLRjCAxINGC0LXQutGB0YIg0LzQtdGC0L7QtNC40YfQutC4LmRv 。所以,你必须base-64解码,然后UTF-8解码,你得到u'часть1текстметодички.do'



如果您想要更一般地执行此操作,则您必须编写代码,尽可能以RFC 2231格式解析每个文件名,否则以RFC 2047格式解释,并且适当的解码步骤。这个代码不是很简单,不能写入StackOverflow的答案,但是基本的想法很简单,如上所述,所以你应该可以自己写。您可能还需要搜索PyPI来实现现有的实现。


I have a python 2.7 script running on windows. It logs in gmail, checks for new e-mails and attachments:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

file_types = ["pdf", "doc", "docx"] # download attachments with these extentions

login = "login"
passw = "password"

imap_server = "imap.gmail.com"
smtp_server = "smtp.gmail.com"
smtp_port = 587

from smtplib import SMTP
from email.parser import HeaderParser
from email.MIMEText import MIMEText
import sys
import imaplib
import getpass
import email
import datetime
import os
import time

if __name__ == "__main__":
    try:
        while True:
            session = imaplib.IMAP4_SSL(imap_server)
            try:
                rv, data = session.login(login, passw)
                print "Logged in: ", rv
            except imaplib.IMAP4.error:
                print "Login failed!"
                sys.exit(1)

            rv, mailboxes = session.list()
            rv, data = session.select(foldr)
            rv, data = session.search(None, "(UNSEEN)")
            for num in data[ 0 ].split():
                rv, data = session.fetch(num, "(RFC822)")
                for rpart in data:
                    if isinstance(rpart, tuple):
                        msg = email.message_from_string(rpart[ 1 ])
                        to = email.utils.parseaddr(msg[ "From" ])[ 1 ]
                text = data[ 0 ][ 1 ]
                msg = email.message_from_string(text)
                got = []
                for part in msg.walk():
                    if part.get_content_maintype() == 'multipart':
                        continue
                    if part.get('Content-Disposition') is None:
                        continue
                    filename = part.get_filename()
                    print "file: ", filename
                    print "Extention: ", filename.split(".")[ -1 ]
                    if filename.split(".")[ -1 ] not in file_types:
                        continue
                    data = part.get_payload(decode = True)
                    if not data:
                        continue
                    date = datetime.datetime.now().strftime("%Y-%m-%d")
                    if not os.path.isdir("CONTENT"):
                        os.mkdir("CONTENT")
                    if not os.path.isdir("CONTENT/" + date):
                        os.mkdir("CONTENT/" + date)
                    ftime = datetime.datetime.now().strftime("%H-%M-%S")
                    new_file = "CONTENT/" + date + "/" + ftime + "_" + filename
                    f = open(new_file, 'wb')
                    print "Got new file %s from %s" % (new_file, to)
                    got.append(filename.encode("utf-8"))
                    f.write(data)
                    f.close()
            session.close()
            session.logout()
            time.sleep(60)
    except:
        print "TARFUN!"

And the problem is that the last print reads garbage:
=?UTF-8?B?0YfQsNGB0YLRjCAxINGC0LXQutGB0YIg0LzQtdGC0L7QtNC40YfQutC4LmRv?=
for example so later checks don't work. On linux it works just fine. For now I tryed to d/e[n]code filename to utf-8. But it did nothing. Thanks in advance.

解决方案

If you read the spec that defines the filename field, RFC 2183, section 2.3, it says:

Current [RFC 2045] grammar restricts parameter values (and hence Content-Disposition filenames) to US-ASCII. We recognize the great desirability of allowing arbitrary character sets in filenames, but it is beyond the scope of this document to define the necessary mechanisms. We expect that the basic [RFC 1521] 'value' specification will someday be amended to allow use of non-US-ASCII characters, at which time the same mechanism should be used in the Content-Disposition filename parameter.

There are proposed RFCs to handle this. In particular, it's been suggested that filenames be handled as encoded-words, as defined by RFC 5987, RFC 2047, and RFC 2231. In brief this means either RFC 2047 format:

"=?" charset "?" encoding "?" encoded-text "?="

… or RFC 2231 format:

"=?" charset ["*" language] "?" encoded-text "?="

Some mail agents are already using this functionality, others don't know what to do with it. The email package in Python 2.x is among those that don't know what to do with it. (It's possible that the later version in Python 3.x does, or that it may change in the future, but that won't help you if you want to stick with 2.x.) So, if you want to parse this, you have to do it yourself.

In your example, you've got a filename in RFC 2047 format, with charset UTF-8 (which is usable directly as a Python encoding name), encoding B, which means Base-64, and content 0YfQsNGB0YLRjCAxINGC0LXQutGB0YIg0LzQtdGC0L7QtNC40YfQutC4LmRv. So, you have to base-64 decode that, then UTF-8-decode that, and you get u'часть 1 текст методички.do'.

If you want to do this more generally, you're going to have to write code which tries to interpret each filename in RFC 2231 format if possible, in RFC 2047 format otherwise, and does the appropriate decoding steps. This code isn't trivial enough to write in a StackOverflow answer, but the basic idea is pretty simple, as demonstrated above, so you should be able to write it yourself. You may also want to search PyPI for existing implementations.

这篇关于电子邮件附件编码错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆