Python电子邮件引用可打印编码问题 [英] Python email quoted-printable encoding problem
问题描述
我正在使用以下方法从Gmail中提取电子邮件:
def getMsgs():
尝试:
conn = imaplib.IMAP4_SSL( imap.gmail.com,993)
除外:
打印'无法连接'
打印'您的互联网连接正常吗?'
sys.exit()
尝试:
conn.login(用户名,密码)
除外:
print'Failed to login'
print'是用户名和密码正确吗?'
sys.exit()
conn.select('收件箱')
#typ,data = conn.search(无,'((未显示的主题% s)'%subject)
typ,data = conn.search(无,'(SUBJECT%s)'%subject)
for data [0] .split():
typ,data = conn.fetch(num,'(RFC822)')
msg = email.message_from_string(data [0] [1])$ b $ b yield walkMsg(msg)
def walkMsg(msg):
,用于msg.walk()中的部分:
如果part.get_content_type()!=文本/纯文本:
继续
返回part.get_payload()
但是,我收到的一些电子邮件几乎不可能提取与编码相关的字符(例如 =)中的日期(使用正则表达式),这些字符随机落在各个文本字段的中间。这是一个在我要提取的日期范围内发生的示例:
名称:KIRSTI电子邮件:
kirsti @ blah。 blah电话号码:+ 999
99995192参加聚会的人数:4人,0位
小孩抵达/离开:10月9日=
,
2010年-2010年10月13日-10月13日2010
是否可以删除这些编码字符?
您可以/应该使用 email.parser
模块来解码邮件,例如(快速又脏的例子!):
<$来自email.parser的p $ p>
导入FeedParser
f = FeedParser()
f.feed(<在此处插入邮件消息,包括所有标头>)
rootMessage = f.close()
#现在您可以访问消息及其子消息(如果是多部分的)
print rootMessage.is_multipart()
#或ch错误提示
print rootMessage.defects
#如果是多段消息,则可以获取第一个子消息,然后获取其有效负载
#(即内容),例如:
rootMessage.get_payload(0).get_payload(decode = True)
使用 Message.get_payload
,该模块会根据其编码自动对内容进行解码(例如,您问题中引用的可打印内容)。
I am extracting emails from Gmail using the following:
def getMsgs():
try:
conn = imaplib.IMAP4_SSL("imap.gmail.com", 993)
except:
print 'Failed to connect'
print 'Is your internet connection working?'
sys.exit()
try:
conn.login(username, password)
except:
print 'Failed to login'
print 'Is the username and password correct?'
sys.exit()
conn.select('Inbox')
# typ, data = conn.search(None, '(UNSEEN SUBJECT "%s")' % subject)
typ, data = conn.search(None, '(SUBJECT "%s")' % subject)
for num in data[0].split():
typ, data = conn.fetch(num, '(RFC822)')
msg = email.message_from_string(data[0][1])
yield walkMsg(msg)
def walkMsg(msg):
for part in msg.walk():
if part.get_content_type() != "text/plain":
continue
return part.get_payload()
However, some emails I get are nigh impossible for me to extract dates (using regex) from as encoding-related chars such as '=', randomly land in the middle of various text fields. Here's an example where it occurs in a date range I want to extract:
Name: KIRSTI Email: kirsti@blah.blah Phone #: + 999 99995192 Total in party: 4 total, 0 children Arrival/Departure: Oct 9= , 2010 - Oct 13, 2010 - Oct 13, 2010
Is there a way to remove these encoding characters?
You could/should use the email.parser
module to decode mail messages, for example (quick and dirty example!):
from email.parser import FeedParser
f = FeedParser()
f.feed("<insert mail message here, including all headers>")
rootMessage = f.close()
# Now you can access the message and its submessages (if it's multipart)
print rootMessage.is_multipart()
# Or check for errors
print rootMessage.defects
# If it's a multipart message, you can get the first submessage and then its payload
# (i.e. content) like so:
rootMessage.get_payload(0).get_payload(decode=True)
Using the "decode" parameter of Message.get_payload
, the module automatically decodes the content, depending on its encoding (e.g. quoted printables as in your question).
这篇关于Python电子邮件引用可打印编码问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!