使用Python在电子邮件正文中查找链接 [英] Finding links in an emails body with Python

查看:128
本文介绍了使用Python在电子邮件正文中查找链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在使用Python开发一个项目,该项目将连接到电子邮件服务器,并查看最新的电子邮件以告诉用户电子邮件中是否包含附件或链接.我有前者的工作,但没有后者.

I am currently working on a project in Python that would be connecting to an email server and looking at the latest email to tell the user if there is an attachment or a link embedded in the email. I have the former working but not the latter.

我可能在脚本的if any()部分遇到了麻烦.当我测试时似乎工作了一半.虽然可能是由于电子邮件字符串的打印方式所致?

I may be having troubles with the if any() part of my script. As it seems to half work when I test. Although it may be due to how the email string is printed out?

这是我的代码,用于连接gmail,然后查找链接.

Here is my code for connecting to gmail and then looking for the link.

import imaplib
import email

word = ["http://", "https://", "www.", ".com", ".co.uk"] #list of strings to search for in email body

#connection to the email server
mail = imaplib.IMAP4_SSL('imap.gmail.com')
mail.login('email@gmail.com', 'password')
mail.list()
# Out: list of "folders" aka labels in gmail.
mail.select("Inbox", readonly=True) # connect to inbox.

result, data = mail.uid('search', None, "ALL") # search and return uids instead

ids = data[0] # data is a list.
id_list = ids.split() # ids is a space separated string
latest_email_uid = data[0].split()[-1]

result, data = mail.uid('fetch', latest_email_uid, '(RFC822)') # fetch the email headers and body (RFC822) for the given ID


raw_email = data[0][1] # here's the body, which is raw headers and html and body of the whole email
# including headers and alternate payloads

print "---------------------------------------------------------"
print "Are there links in the email?"
print "---------------------------------------------------------"

msg = email.message_from_string(raw_email)
for part in msg.walk():
    # each part is a either non-multipart, or another multipart message
    # that contains further parts... Message is organized like a tree
    if part.get_content_type() == 'text/plain':
        plain_text = part.get_payload()
        print plain_text # prints the raw text
        if any(word in plain_text for word in word):
            print '****'
            print 'found link in email body'
            print '****'
        else:
            print '****'
            print 'no link in email body'
            print '****'

基本上,如您所见,我有一个名为"Word"的变量,其中包含要在纯文本电子邮件中搜索的一组关键字.

So basically as you can see I have a variable called 'Word' which contains an array of keywords to search for in the plain text email.

当我发送带有"http://"或"https://"格式的嵌入式链接的测试电子邮件时,该电子邮件将打印出带有以下文本中的链接的电子邮件正文-

When I send a test email with an embedded link that is in the format of 'http://' or 'https://' - the email prints out the email body with the link in the text like this -

---------------------------------------------------------
Are there links in the email?
---------------------------------------------------------
Test Link <http://www.google.com/>


****
found link in email body
****

我得到的打印消息是在电子邮件正文中找到了链接",这是我在测试阶段中寻找的结果,但这将导致最终程序中发生其他事情.

And I get my print message saying 'found link in email body' - which is the result I am looking for in my test phase, yet this will lead onto something else to happen within the final program.

但是,如果我在电子邮件中添加了不带http://的嵌入式链接(例如google.com),那么即使我具有嵌入式链接,该链接也不会打印出来,也不会得到结果.

Yet, if I add an embedded link in the email with no http:// such as google.com then the link doesn't print out and I don't get the result, even though I have an embedded link.

这是有原因的吗?我也怀疑我的if()循环确实不是最好的.当我最初添加它时,我并不十分了解它,但是它适用于http://链接.然后,我只尝试了一个.com,但遇到了我找不到解决方案的问题.

Is there a reason for this? I'm also suspecting maybe my if any() loops is not really the best. I didn't really understand it when I originally added it but it worked for http:// links. Then I tried just a .com and got my problem which I am having trouble finding a solution for.

推荐答案

要检查电子邮件是否存在附件,您可以在邮件头中搜索Content-Type,并查看其是否显示"multipart/*".包含多部分内容类型的电子邮件可能包含附件.

To check if there are attachments to an e-mail you can search the headers for Content-Type and see if it says "multipart/*". E-mails with multipart content types may contain attachments.

要检查文本中的链接,图像等,可以尝试使用正则表达式.事实上,我认为这可能是您最好的选择.使用正则表达式(或正则表达式),您可以找到与给定模式匹配的字符串.例如,模式"<a[^>]+href=\"(.*?)\"[^>]*>(.*)?</a>"应该匹配电子邮件中的所有链接,无论它们是单个单词还是完整URL.希望对您有所帮助! 这是一个如何在Python中实现的示例:

To inspect the text for links, images, etc, you can try using Regular Expressions. As a matter of fact, this is probably your best option in my opinion. With regex (or Regular Expressions) you can find strings that match a given pattern. The pattern "<a[^>]+href=\"(.*?)\"[^>]*>(.*)?</a>", for example, should match all links in your email message regardless of whether they are a single word or a full URL. I hope that helps! Here's an example of how you can implement this in Python:

import re

text = "This is your e-mail body. It contains a link to <a 
href='http//www.google.com'>Google</a>."

link_pattern = re.compile('<a[^>]+href=\'(.*?)\'[^>]*>(.*)?</a>')
search = link_pattern.search(text)
if search is not None:
    print("Link found! -> " + search.group(0))
else:
    print("No links were found.")

对于最终用户",该链接将仅显示为"Google",而没有www,而没有更少的http(s)....但是,源代码将使用html对其进行包装,因此可以检查原始内容您可以在邮件中找到所有链接.

For the "end-user" the link will just appear as "Google", without www and much less http(s)... However, the source code will have the html wrapping it, so by inspecting the raw body of the message you can find all links.

我的代码并不完美,但我希望它能为您提供一个大致的方向...您可以在电子邮件正文中查找多个模式,以查看图像出现情况,视频等.要学习正则表达式,您将需要进行一些研究,这是指向维基百科的另一个链接

My code is not perfect but I hope it gives you a general direction... You can have multiple patterns looked up in your e-mail body text, for image occurences, videos, etc. To learn Regular Expressions you'll need to research a little, here's another link, to Wikipedia

这篇关于使用Python在电子邮件正文中查找链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆