从以CSV格式保存的URL列表中清除电子邮件-BeautifulSoup [英] Scrapes Emails from a list of URLs saved in CSV - BeautifulSoup

查看:60
本文介绍了从以CSV格式保存的URL列表中清除电子邮件-BeautifulSoup的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试解析以CSV格式保存的URL列表以抓取电子邮件地址.但是,以下代码仅设法从单个网站获取电子邮件地址.需要有关如何修改代码以遍历列表并将结果(电子邮件列表)保存到csv文件的建议.

I am trying to parse thru a list of URLs saved in CSV format to scrape email addresses. However, the below code only managed to fetch email addresses from a single website. Need advice on how to modify the code to loop thru the list and save the outcome (the list of emails) to csv file.

import requests
import re
import csv
from bs4 import BeautifulSoup

allLinks = [];mails=[]
with open(r'url.csv', newline='') as csvfile:
    urls = csv.reader(csvfile, delimiter=' ', quotechar='|')
    links = []
    for url in urls:
        response = requests.get(url)
        soup=BeautifulSoup(response.text,'html.parser')
        links = [a.attrs.get('href') for a in soup.select('a[href]') ]

allLinks=set(links)

def findMails(soup):
    for name in soup.find_all('a'):
        if(name is not None):
            emailText=name.text
            match=bool(re.match('[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$',emailText))
            if('@' in emailText and match==True):
                emailText=emailText.replace(" ",'').replace('\r','')
                emailText=emailText.replace('\n','').replace('\t','')
                if(len(mails)==0)or(emailText not in mails):
                    print(emailText)
                mails.append(emailText)
for link in allLinks:
    if(link.startswith("http") or link.startswith("www")):
        r=requests.get(link)
        data=r.text
        soup=BeautifulSoup(data,'html.parser')
        findMails(soup)

    else:
        newurl=url+link
        r=requests.get(newurl)
        data=r.text
        soup=BeautifulSoup(data,'html.parser')
        findMails(soup)

mails=set(mails)
if(len(mails)==0):
    print("NO MAILS FOUND")

推荐答案

要将链接添加到链接中.

allLinks = [];mails=[]
urls = ['https://www.nus.edu.sg/', 'http://gwiconsulting.com/']
links = []

for url in urls:
    response = requests.get(url)
    soup=BeautifulSoup(response.text,'html.parser')
    links += [a.attrs.get('href') for a in soup.select('a[href]') ]

allLinks=set(links)

最后循环您的邮件并写入csv

At end loop your mails and write to csv

import csv

with open("emails.csv", "w", encoding="utf-8-sig", newline='') as csv_file:
    w = csv.writer(csv_file, delimiter = ",", quoting=csv.QUOTE_MINIMAL)
    w.writerow(['Email'])
    for mail in mails:
        w.writerow(mail)

这篇关于从以CSV格式保存的URL列表中清除电子邮件-BeautifulSoup的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆