来自网址列表的电子邮件抓取工具 [英] Email scraper from a list of urls

查看:125
本文介绍了来自网址列表的电子邮件抓取工具的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试构建什么

在 Sheet1 列 A row2 下到 x 行,将有一个网站 URL 列表.我需要代码来浏览 url 并找到电话号码和电子邮件并将它们放在 url 旁边的 B + C 列中,如果没有找到,请在单元格中放置一个连字符.

我几乎已经开始工作了.代码循环遍历 Sheet1 列 A 中的 URL 列表并提取电话号码和电子邮件,将它们放入列 B 和 C.我编写的当前代码只有 3 个问题,这些问题如下所述问题 3 可能很简单修复.

新代码

私有子命令Button1_Click()' 运行主代码Dim wb 作为工作簿将 wsSheet 作为工作表变暗,将链接作为变体,IE 作为对象,将链接作为变体Dim rw As Long将 html 调暗为新的 HTMLDocumentDim regxp As New RegExp, post As Object, phone_list As Object, email_list As Object'SHEET1 作为带有 URL 的工作表设置 wb = ThisWorkbook设置 wsSheet = wb.Sheets("Sheet1")'设置IE = InternetExplorerSet IE = CreateObject("InternetExplorer.Application")rw = wsSheet.Cells(wsSheet.Rows.Count, A").End(xlUp).Row链接 = wsSheet.Range("A2:A" & rw)'IE 打开时间每页 4 秒并检查 Sheet2 A 列上的链接用 IE.可见 = 真Application.Wait(现在 + 时间值(00:00:04"))对于每个链接在链接中.navigate(链接)而 .Busy 或 .readyState <>4:DoEvents:温德设置 html = .document'Application.Wait (Now + TimeValue("00:00:04"))使用 regxp.Pattern = "(?:\+1)?(?:\+[0-9])?\(?([0-9]{3})\)?[-.]?([0-9]{3})[-.]?([0-9]{4})"设置 phone_list = .Execute(html.body.innerHTML).Pattern = "[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+";设置 email_list = .Execute(html.body.innerHTML)结束于Sheet1.Cells(Sheet1.Cells(Sheet1.Rows.Count, B").End(xlUp).Row + 1, B").Value = phone_list(0)Sheet1.Cells(Sheet1.Cells(Sheet1.Rows.Count, C").End(xlUp).Row + 1, C").Value = email_list(0)''''############################## 我试图解决这个问题#########################'''' ############################# 放置一个连字符如果没有找到#########################'''' 如果 regxp 什么都没有,那么'''' wsSheet.Cells(Sheet1.Cells(Sheet1.Rows.Count, "B").End(xlUp).Row + 1, "B").Value = phone_list()''''        别的'''' wsSheet.Cells(Sheet1.Cells(Sheet1.Rows.Count, "B").End(xlUp).Row + 1, "B").Value = phone_list(0)''''        万一'''''''' 如果 regxp 什么都没有,那么'''' wsSheet.Cells(Sheet1.Cells(Sheet1.Rows.Count, C").End(xlUp).Row + 1, C").Value = email_list()''''        别的'''' wsSheet.Cells(Sheet1.Cells(Sheet1.Rows.Count, "C").End(xlUp).Row + 1, "C").Value = email_list(0)''''        万一''''############################## 我试图解决这个问题#########################''''##########################################################################################'导航链接下一个链接'关闭IE浏览器.放弃结束于设置 IE = 无结束子

问题 1

如果没有要提取的项目,则代码不会转到下一个 url,出于某种原因,它只是停留在该页面上,否则我会收到一条错误消息.例如 网站有电话号码但没有电子邮件,页面不会导航到下一个网址.我尝试使用 IF 语句解决此问题,但无法使其正常工作..应该怎么做 如果没有什么可提取的,请转到 A 列中的下一个网址

问题 2

如果网站的安全证书无效或网址已失效,则代码不会导航到下一个网址,而是等待用户输入.如果我点击否"声明我不想导航到此站点代码崩溃.如果证书无效或 url 已死,那么它应该移动到下一个 url,所以如果站点在 X 时间内没有加载,则移动到下一个 url. 不确定这是否也可以使用对于问题1

我想我需要这样的东西,但不能用我的代码解决

感谢您的浏览 请有人帮我解决这三个问题中的任何一个.提前致谢.

英国时间今天 24/7/2020 12:56 更新

我添加了一个更好的 Regxp 来查找电话号码,自从发布问题 3 以来,它有所改进 .Pattern = "(?:\+1)?(?:\+[0-9])?\(?([0-9]{3})\)?[-.]?([0-9]{3})[-.]?([0-9]{4})"它仍然在剪裁数字的一些数字,见下图,绿色数字找到并且最后一位数字丢失

发表于 Mr Excel

#############今天更新,英国时间 2020 年 7 月 31 日星期五下午 1 点 26 分

我已经解决了数据NOT进入正确位置的问题,通过使用 IF 语句代码.所以现在问题 1 和 2 看起来没问题.只剩下问题 3,我认为这是一个简单的修复 LOL.

问题是这样的

如果 regxp 什么都没有,那么

应该是

如果 Phone_List (0) 是什么,那么

如果 Email_List (0) 是什么,那么

########### 今日更新,英国时间 8 月 3 日星期一 11:45 #############

这是我解决电话号码不正确的问题 3 的解决方法.

我已经更改了代码的模式部分,所以现在它从工作表中提取 REGXP 模式,Sheet1.Range D1.通过这种方式,我可以更改单元格中的 regxp 模式以获取不同的电话号码类型.

''' ########## 电话号码模式###########.Pattern = ThisWorkbook.Sheets(Sheet1").Range(D1").Global = 假.IgnoreCase = 真设置 phone_list = .Execute(html.body.innerHtml)

这是我目前在英国使用的 Regxp 模式.它被放置在 Sheet1 CELL D1 中

(?:\+1)?(?:\+[0-9])?\(?([0-9]{4})\)?[-.]?([0-9]{4})[-.]?([0-9]{3}?)

如果有人有更好的模式请留言.

########## 英国时间 2020 年 8 月 5 日星期二 1:35 今天更新##########

我有 MSXML2.ServerXMLHTTP 代码,它运行得更快,但遗漏了一些电子邮件和号码.我编写的 IE 版本和 SMTH 编写的代码 ANSWER 获取额外的电子邮件和电话号码.我将 SMTH ANSWER 中的 regxp 模式更改为我的,以获得更好的结果.

如果有人知道为什么那么请提供建议,否则 SMTH 代码就是答案,因为它与我的工作相同,但写得更好.

Private Sub CommandButton2_Click()'''######### 没有 IE 这个代码更快######Dim wb 作为工作簿将 wsSheet 变暗为工作表、链接为变体、链接为变体Dim rw As LongDim regxp As New RegExp, post As Object, phone_list As Object, email_list As ObjectDim Html 作为新的 HTMLDocument''''SHEET1 作为带有 URL 的工作表设置 wb = ThisWorkbook设置 wsSheet = wb.Sheets("Sheet1")rw = wsSheet.Cells(wsSheet.Rows.Count, A").End(xlUp).Row链接 = wsSheet.Range("A2:A" & rw)对于每个链接在链接中'设置 doc = NewHTMLDocument(CStr(link))设置 Html = NewHTMLDocument(CStr(link))使用 regxp''' ########## 电话号码模式###########.Pattern = "(?:\+1)?(?:\+[0-9])?\(?([0-9]{4})\)?[-.]?([0-9]{4})[-.]?([0-9]{3}?)"'"(?:\+1)?(?:\+[0-9])?\(?([0-9]{3})\)?[-.]?([0-9]{3})[-.]?([0-9]{3}?)".Global = 假.IgnoreCase = 真设置 phone_list = .Execute(Html.body.innerHtml)''' ########## 电子邮件模式 ###########.Pattern = "([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)".Global = 假.IgnoreCase = 真设置 email_list = .Execute(Html.body.innerHtml)'''########## 电话列表############# 添加到工作表出错时继续下一步如果 phone_list(0) 什么都没有,那么出错时继续下一步Sheet1.Cells(Sheet1.Cells(Sheet1.Rows.Count, "B").End(xlUp).Row + 1, "B").Value = "-";别的出错时继续下一步Sheet1.Cells(Sheet1.Cells(Sheet1.Rows.Count, B").End(xlUp).Row + 1, B").Value = phone_list(0)万一'''########## 电子邮件列表############# 添加到工作表出错时继续下一步如果 email_list(0) 什么都没有,那么出错时继续下一步Sheet1.Cells(Sheet1.Cells(Sheet1.Rows.Count, C").End(xlUp).Row + 1, C").Value = -"别的出错时继续下一步Sheet1.Cells(Sheet1.Cells(Sheet1.Rows.Count, C").End(xlUp).Row + 1, C").Value = email_list(0)万一结束于''''导航链接下一个链接结束子公共函数 NewHTMLDocument(strURL As String) As ObjectDim objHTTP As Object, objHTML As Object, strTemp As String设置 objHTTP = CreateObject("MSXML2.ServerXMLHTTP")objHTTP.Open "GET", strURL, FalseobjHTTP.send如果 objHTTP.Status = 200 那么strTemp = objHTTP.responseTextSet objHTML = CreateObject("htmlfile")objHTML.body.innerHtml = strTemp设置 NewHTMLDocument = objHTML别的'出现了错误万一结束函数

解决方案

这是您可以使用宏遍历所有要遍历的预定义链接以收集电子邮件和电话号码的方法.要确定您的模式是否可以找到任何东西,您需要使用 .Count 属性,如下所示.您可以随时替换我在下面使用的模式,因为它们与您的主要问题无关.

Sub GetEmailAndPhone()Dim IE As Object: Set IE = CreateObject("InternetExplorer.Application")Dim Rxp As Object: Set Rxp = CreateObject("VBScript.RegExp")Dim emailMatch As Object, phoneMatch As Object, S$, cel As RangeDim Html 作为 HTMLDocument对于 Sheets 中的每个 cel(Sheet1").Range(A2:A"& Sheets(Sheet1").Cells(Rows.Count, 1).End(xlUp).Row)用 IE.可见=假.navigate cel而 .Busy 或 .readyState <>4:DoEvents:温德设置 Html = .document结束于与接收.Pattern = "[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+";设置 emailMatch = .Execute(Html.body.innerHTML).Pattern = "(?:\+1)?(?:\+[0-9])?\(?([0-9]{3})\)?[-.]?([0-9]{3})[-.]?([0-9]{4})"设置 phoneMatch = .Execute(Html.body.innerHTML)结束于如果 emailMatch.Count >= 1 那么cel(1, 2) = emailMatch(0)别的:cel(1, 2) = 未找到"万一如果 phoneMatch.Count >= 1 那么cel(1, 3) = phoneMatch(0)别的:cel(1, 3) = 未找到"万一下一个单元格结束子

What I am trying to build

In Sheet1 column A row2 down to x amount of rows, there will be a list of website URLs. I need the code to go through the urls and find the phone numbers and emails and place them in column B + C next to the urls, if nothing is found place a hyphen in the cell.

I have almost got this working. The code loops through a list of URLS in Sheet1 column A and pulls the phone numbers and emails, places them into column B and C. I just have 3 problems with the current code i wrote, these are stated below Problem 3 might be a simple fix.

THE NEW CODE

Private Sub CommandButton1_Click()
' Run main code
Dim wb As Workbook
Dim wsSheet As Worksheet, links As Variant, IE As Object, link As Variant
Dim rw As Long
Dim html As New HTMLDocument
Dim regxp As New RegExp, post As Object, phone_list As Object, email_list As Object

'SHEET1 as sheet with URL
    Set wb = ThisWorkbook
    Set wsSheet = wb.Sheets("Sheet1")

    'Set IE = InternetExplorer
    Set IE = CreateObject("InternetExplorer.Application")

    rw = wsSheet.Cells(wsSheet.Rows.Count, "A").End(xlUp).Row
    links = wsSheet.Range("A2:A" & rw)

    'IE Open Time per page 4sec and check links on Sheet2 Column A
    With IE
       .Visible = True
       Application.Wait (Now + TimeValue("00:00:04"))

       For Each link In links
           .navigate (link)
           While .Busy Or .readyState <> 4: DoEvents: Wend

Set html = .document

'Application.Wait (Now + TimeValue("00:00:04"))
  With regxp
        .Pattern = "(?:\+1)?(?:\+[0-9])?\(?([0-9]{3})\)?[-. ]?([0-9]{3})[-. ]?([0-9]{4})"
        Set phone_list = .Execute(html.body.innerHTML)
        .Pattern = "[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+"
        Set email_list = .Execute(html.body.innerHTML)
    End With
   
    Sheet1.Cells(Sheet1.Cells(Sheet1.Rows.Count, "B").End(xlUp).Row + 1, "B").Value = phone_list(0)
    Sheet1.Cells(Sheet1.Cells(Sheet1.Rows.Count, "C").End(xlUp).Row + 1, "C").Value = email_list(0)
 ''''################################ I TRIED TO FIX THE PROBLEM WITH THIS  #########################
 '''' ############################### TO PLACE A HYPHEN IF NOTHING IS FOUND #########################
''''      If regxp Is Nothing Then
''''            wsSheet.Cells(Sheet1.Cells(Sheet1.Rows.Count, "B").End(xlUp).Row + 1, "B").Value = phone_list()
''''        Else
''''             wsSheet.Cells(Sheet1.Cells(Sheet1.Rows.Count, "B").End(xlUp).Row + 1, "B").Value = phone_list(0)
''''        End If
''''
''''        If regxp Is Nothing Then
''''            wsSheet.Cells(Sheet1.Cells(Sheet1.Rows.Count, "C").End(xlUp).Row + 1, "C").Value = email_list()
''''        Else
''''            wsSheet.Cells(Sheet1.Cells(Sheet1.Rows.Count, "C").End(xlUp).Row + 1, "C").Value = email_list(0)
''''        End If
''''################################ I TRIED TO FIX THE PROBLEM WITH THIS  #########################
''''################################################################################################

'navigate links
      Next link

'Close IE Browser
    .Quit
    End With

    Set IE = Nothing
End Sub

Problem 1

If there is no item to extract then the code does not go to the next url, for some reason it just stay on that page, or I get an error message. e.g website has phone number but no email the page will not navigate to the next url. I tried to fix this with an IF statement but could not get it to work.. What it should do If there is nothing to extract go to the next urls in column A

Problem 2

If the website has an invalid security certificate or the url is DEAD then the code does not navigate to the next url, it waits for a user input. If I click "NO" to state I do not wish to to navigate to this site the code crashes. If the certificate is invalid or url is DEAD then it should move to the next url, so if site has not loaded in X amount of time move to the next url. Not sure if this could also be used for problem1

I think I need something like this, but can't work it out with my code Mr Excel

Problem 3

This might just be an excel column formating issue unless I have have got the phone number expression wrong in the code. As you can see the phone numbers are not showing correct. I am not sure if excel is clearing an "0" and that is why the numbers are wrong or the phone number expression is wrong.

Thanks for having a look Please could sombody help me out on anyone of the three issues. As aways THANKS in advance.

UPDATED TODAY 24/7/2020 AT 12:56 UK TIME

I have added a better Regxp for finding phone numbers, since posting PROBLEM 3, it has improved a bit .Pattern = "(?:\+1)?(?:\+[0-9])?\(?([0-9]{3})\)?[-. ]?([0-9]{3})[-. ]?([0-9]{4})" it is still however clipping some digits of the numbers, see image below, numbers in green where found and last digit is missing

Also Posted on Mr Excel Mr Excel.

####### Added Today Thursday 30th July 2020 4:00pm Uk time ########

I am trying an If statement so If Nothing is found then place a hyphen, see below

If email_list Is Nothing Then
'On Error Resume Next
wsSheet.Cells(Sheet1.Cells(Sheet1.Rows.Count, "c").End(xlUp).Row + 1, "c").Value = "-"
Else
On Error Resume Next
wsSheet.Cells(Sheet1.Cells(Sheet1.Rows.Count, "c").End(xlUp).Row + 1, "c").Value = email_list(0)
End If
End With

However I can Not get it to work, the On Error Resume Next allows me to move to the next url and gets rid of the first error message.

The emails NOW pull off as such, I have colour coded them for easy viewing. As you can see from the colours they are NOT next to the correct urls, This is why I was trying to place an hyphen in the cell at least then that cell would be poplulated an the date would go into the next blank cell, thus keeping everything in line.

############## Updated Today FRIDAY 31st JULY 2020 1.26PM UK time

I have fixed the problem with data NOT going into the right place, By uking the IF statement code. So now Problems 1 and 2 seem fine. Only problem 3 remains, which I though would be a simple fix LOL.

The problem was this

If regxp Is Nothing Then

It should have been

If Phone_List (0) Is Nothing Then

And

If Email_List (0) Is Nothing Then

########### UPDATED TODAY Monday 3rd August 11.45 Uk time #############

This is my workaround to overcome Problem 3 for phone numbers not pulling of correct.

I have changed the Pattern part of the code, so now it pulls the REGXP pattern from the Sheet, Sheet1.Range D1. This way I can change the regxp pattern in the cell to pull off different phone number types.

''' ########## Phone Numbers Pattern ###########
        .Pattern = ThisWorkbook.Sheets("Sheet1").Range("D1")
        .Global = False
        .IgnoreCase = True
        Set phone_list = .Execute(html.body.innerHtml)

This is the Regxp pattern I am using for now, for uk. It is in placed in Sheet1 CELL D1

(?:\+1)?(?:\+[0-9])?\(?([0-9]{4})\)?[-. ]?([0-9]{4})[-. ]?([0-9]{3}?)

If anyone has a better pattern please post.

########## Updated today Tuesday 5th August 2020 1:35 uk time ##########

I have MSXML2.ServerXMLHTTP code which works much faster, but misses a few emails and numbers. Where as the IE version I wrote and the code ANSWER written by SMTH pick up the extra emails and phone numbers. I changed the regxp patters in SMTH ANSWER to mine for better results.

If anyone knows why then please advise, otherwise SMTH code is the answer as it does the same job as mine, but is written much better.

Private Sub CommandButton2_Click()
'''######### NO IE THIS CODE IS FASTER ######
Dim wb As Workbook
Dim wsSheet As Worksheet, links As Variant, link As Variant
Dim rw As Long
Dim regxp As New RegExp, post As Object, phone_list As Object, email_list As Object
Dim Html As New HTMLDocument

''''SHEET1 as sheet with URL
    Set wb = ThisWorkbook
    Set wsSheet = wb.Sheets("Sheet1")
   
    rw = wsSheet.Cells(wsSheet.Rows.Count, "A").End(xlUp).Row
    links = wsSheet.Range("A2:A" & rw)

For Each link In links
    'Set doc = NewHTMLDocument(CStr(link))
      Set Html = NewHTMLDocument(CStr(link))
With regxp
''' ########## Phone Numbers Pattern ###########
        .Pattern = "(?:\+1)?(?:\+[0-9])?\(?([0-9]{4})\)?[-. ]?([0-9]{4})[-. ]?([0-9]{3}?)" '"(?:\+1)?(?:\+[0-9])?\(?([0-9]{3})\)?[-. ]?([0-9]{3})[-. ]?([0-9]{3}?)"
        .Global = False
        .IgnoreCase = True
        Set phone_list = .Execute(Html.body.innerHtml)
''' ########## Email Pattern ###########
        .Pattern = "([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)"
        .Global = False
       .IgnoreCase = True
        Set email_list = .Execute(Html.body.innerHtml)
    

'''########## PHONE LIST ############# ADD TO SHEET
On Error Resume Next
    If phone_list(0) Is Nothing Then
        On Error Resume Next
            Sheet1.Cells(Sheet1.Cells(Sheet1.Rows.Count, "B").End(xlUp).Row + 1, "B").Value = "-"
    Else
        On Error Resume Next
            Sheet1.Cells(Sheet1.Cells(Sheet1.Rows.Count, "B").End(xlUp).Row + 1, "B").Value = phone_list(0)
    End If
'''########## EMAIL LIST ############# ADD TO SHEET
On Error Resume Next
      If email_list(0) Is Nothing Then
        On Error Resume Next
            Sheet1.Cells(Sheet1.Cells(Sheet1.Rows.Count, "C").End(xlUp).Row + 1, "C").Value = "-"
    Else
        On Error Resume Next
            Sheet1.Cells(Sheet1.Cells(Sheet1.Rows.Count, "C").End(xlUp).Row + 1, "C").Value = email_list(0)
    End If
End With
''''navigate links
     Next link

End Sub

Public Function NewHTMLDocument(strURL As String) As Object
    Dim objHTTP As Object, objHTML As Object, strTemp As String
    Set objHTTP = CreateObject("MSXML2.ServerXMLHTTP")
    objHTTP.Open "GET", strURL, False
    objHTTP.send
    If objHTTP.Status = 200 Then
        strTemp = objHTTP.responseText
    Set objHTML = CreateObject("htmlfile")
    objHTML.body.innerHtml = strTemp
    Set NewHTMLDocument = objHTML
Else
   'There has been an error
End If
End Function

解决方案

This is how you can loop through all the predefined links you wanna traverse using macro in order to collect email and phone numbers. To figure out whether your patterns could find anything, you wanna use .Count property like I've shown below. You can always replace the patterns I've used below as they are not relevant to your major issues.

Sub GetEmailAndPhone()
    Dim IE As Object: Set IE = CreateObject("InternetExplorer.Application")
    Dim Rxp As Object: Set Rxp = CreateObject("VBScript.RegExp")
    Dim emailMatch As Object, phoneMatch  As Object, S$, cel As Range
    Dim Html As HTMLDocument

    For Each cel In Sheets("Sheet1").Range("A2:A" & Sheets("Sheet1").Cells(Rows.Count, 1).End(xlUp).Row)
        With IE
            .Visible = False
            .navigate cel
            While .Busy Or .readyState <> 4: DoEvents: Wend
            Set Html = .document
        End With
        
        With Rxp
            .Pattern = "[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+"
            Set emailMatch = .Execute(Html.body.innerHTML)
            .Pattern = "(?:\+1)?(?:\+[0-9])?\(?([0-9]{3})\)?[-. ]?([0-9]{3})[-. ]?([0-9]{4})"
            Set phoneMatch = .Execute(Html.body.innerHTML)
        End With
        
        If emailMatch.Count >= 1 Then
            cel(1, 2) = emailMatch(0)
        Else:
            cel(1, 2) = "Not Found"
        End If
        
        If phoneMatch.Count >= 1 Then
            cel(1, 3) = phoneMatch(0)
        Else:
            cel(1, 3) = "Not Found"
        End If
    Next cel
End Sub

这篇关于来自网址列表的电子邮件抓取工具的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆