批量电子邮件提取器.需要添加Regxp-Excel VBA [英] Bulk Email Extractor. Need to add a Regxp - excel VBA

查看:28
本文介绍了批量电子邮件提取器.需要添加Regxp-Excel VBA的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试用第二个代码的一部分更新我的第一个代码,但是现在已经挣扎了几周.两种代码都从网址列表中提取电子邮件.

I am trying to update my first code with parts of a second code, but have been struggling for a few weeks now. Both codes extract emails from a url list.

第一密码

除了使用 Mailto:而不是 Regxp ,此代码还不错.我正在尝试用第二个代码中的 Regxp 替换第一个代码中的Mailto :,因为regxp提取了更多电子邮件.

This code is fine apart from it uses Mailto: rather than a Regxp. I am trying to replace the Mailto: in the first code with a Regxp from the second code as the regxp extracts more emails.

第一个代码的工作方式.

网址列表放置在 Sheet2网址" 中,结果显示在 Sheet1结果" 中.此代码将从网站中提取 ALL 电子邮件,因此,如果有10封电子邮件,它将提取所有10封电子邮件;如果有100封电子邮件,那么它将提取所有100封电子邮件.SECOND代码仅每个站点提取 1封电子邮件.

A list of urls are placed in Sheet2 "Urls" and the results are show in Sheet1 "Results". This code will extract ALL emails from the site, so if there are 10 emails it will extract ALL 10, if 100 then it will extract All 100 EMAIL. The SECOND code only extracts 1 email per site.

FIRST CODE 的问题是 Mailto :会发送很多电子邮件,而 REGXP 会捕获更多电子邮件,依此类推我想更新第一个代码.但是,由于我对编写代码并不热衷,所以我已经苦苦挣扎了几个星期.第一个代码最初是由我编写的,然后由另一位开发人员进行了更新,而他的代码超出了我的技术水平.我已经尝试了数周以进行更新,但似乎无法解决任何问题,因此我决定发布.

The problem with the FIRST CODE is that the Mailto: does MISS a lot of email where as the REGXP captures more, and so I want to update the first code. However I have been struggling for a few weeks now as I am not super hot in writing code. The first code was originaly written by me and then updated by another developer and his code is wayout of my skill depth. I have been trying for weeks to update it but can not seem to work anything out so I decided to post.

  1. 我在下面列出了第一个代码.您可以从此处下载链接优先代码

链接到我也编写的第二个代码以及我尝试使用的EMAIL REGXP,

Link to my second code that I also wrote, and the EMAIL REGXP i am trying to use, My Post StackOver Flow

第一密码,群发电子邮件提取器.仅电子邮件部分需要使用REGXP更新.

Sub ScrapeSoMeAndMailAddresses()
'Columns for both tables
Const colUrl As Long = 1 'Must always be the first column
Const colMail As Long = 2 'Must always be the first column before Some platforms
Const colFacebook As Long = 3 'Must always be the last column of Some platforms
Const colError As Long = 4 'Must always be the last column

Dim url As String
Dim http As Object
Dim htmlDoc As Object
Dim nodeAllLinks As Object
Dim nodeOneLink As Object
Dim pageLoadSuccessful As Boolean
Dim tableUrlsOneAddressLeft As String
Dim tableAllAddresses As String
Dim currentRowTableUrls As Long
Dim lastRowTableUrls As Long
Dim currentRowsTableAll(colUrl To colFacebook) As Long
Dim lastRowTableAll As Long
Dim addressCounters(colMail To colFacebook) As Long
Dim checkCounters As Long

  'Initialize variables
  tableUrlsOneAddressLeft = "Urls" ''Name of Sheet
  currentRowTableUrls = 2 'First row for content
  tableAllAddresses = "Results" ''Name of Sheet
  For checkCounters = colUrl To colFacebook
    currentRowsTableAll(checkCounters) = 2 'First rows for content
  Next checkCounters
  Set htmlDoc = CreateObject("htmlfile")
  Set http = CreateObject("MSXML2.ServerXMLHTTP.6.0")
  
'Clear all contents and comments in the URL source sheet from email column to error column
  With Sheets(tableUrlsOneAddressLeft)
        lastRowTableUrls = .Cells(Rows.Count, colUrl).End(xlUp).Row
        .Range(.Cells(currentRowTableUrls, colMail), .Cells(lastRowTableUrls, colError)).ClearContents
        .Range(.Cells(currentRowTableUrls, colMail), .Cells(lastRowTableUrls, colError)).ClearComments
  End With
  
'Delete all rows except headline in the sheet with all addresses
        lastRowTableAll = Sheets(tableAllAddresses).Cells(Rows.Count, colUrl).End(xlUp).Row
        Sheets(tableAllAddresses).Rows(currentRowsTableAll(colUrl) & ":" & lastRowTableAll).Delete Shift:=xlUp
  
'Loop over all URLs in column A in the URL source sheet
Do While Sheets(tableUrlsOneAddressLeft).Cells(currentRowTableUrls, 1).Value <> ""
    'Scroll for visual monitoring, if 'the sheet with the URLs are the
    'active one
    If ActiveSheet.Name = tableUrlsOneAddressLeft Then
        If currentRowTableUrls > 14 Then
            ActiveWindow.SmallScroll down:=1
    End If
        Sheets(tableUrlsOneAddressLeft).Cells(currentRowTableUrls, 1).Select
    End If
    
'Get next url from the URL source sheet
    url = Sheets(tableUrlsOneAddressLeft).Cells(currentRowTableUrls, colUrl).Value
    
'Try to load page 'Temporarily disable error handling if 'there is a timeout or onother error
    On Error Resume Next
    http.Open "GET", url, False
    http.send
    
'Check if page loading was successful
    If Err.Number = 0 Then
         pageLoadSuccessful = True
    End If
    On Error GoTo 0
    
    If pageLoadSuccessful Then
        'Build html document for DOM operations
        htmlDoc.body.innerHtml = http.responseText
        'Create node list from all links of the page
        Set nodeAllLinks = htmlDoc.getElementsByTagName("a")
        'Walk through all links of the node list
      
    For Each nodeOneLink In nodeAllLinks
    
    
'''#####################################################################################################
'''################################### THIS IS THE START OF THE EMAIL SECTION ##########################
'''#####################################################################################################
'Check for mail address
    If InStr(1, nodeOneLink.href, "mailto:") Then
            'Write mail address to both tables
            Sheets(tableUrlsOneAddressLeft).Cells(currentRowTableUrls, colMail).Value = Right(nodeOneLink.href, Len(nodeOneLink.href) - InStr(nodeOneLink.href, ":"))
            Sheets(tableAllAddresses).Cells(currentRowsTableAll(colMail), colMail).Value = Right(nodeOneLink.href, Len(nodeOneLink.href) - InStr(nodeOneLink.href, ":"))
            'Check if it is a new line in the sheet with all addresses
    If currentRowsTableAll(colMail) >= currentRowsTableAll(colUrl) Then
            'Write URL in the new line of the sheet with all addresses
            Sheets(tableAllAddresses).Cells(currentRowsTableAll(colUrl), colUrl).Value = url
            'Increment url counter
            currentRowsTableAll(colUrl) = currentRowsTableAll(colUrl) + 1
    End If
            'Increment mail counters
             currentRowsTableAll(colMail) = currentRowsTableAll(colMail) + 1
            addressCounters(colMail) = addressCounters(colMail) + 1
    End If
    
'''#####################################################################################################
'''################################### END OF THE EMAIL SECTION ########################################
'''#####################################################################################################



  'Check for Facebook address
        If InStr(1, UCase(nodeOneLink.href), "FACEBOOK") Then
          'Write Facebook address to both tables
          Sheets(tableUrlsOneAddressLeft).Cells(currentRowTableUrls, colFacebook).Value = nodeOneLink.href
          Sheets(tableAllAddresses).Cells(currentRowsTableAll(colFacebook), colFacebook).Value = nodeOneLink.href
          'Check if it is a new line in the sheet with all addresses
          If currentRowsTableAll(colFacebook) >= currentRowsTableAll(colUrl) Then
            'Write URL in the new line of the sheet with all addresses
            Sheets(tableAllAddresses).Cells(currentRowsTableAll(colUrl), colUrl).Value = url
            'Increment url counter
            currentRowsTableAll(colUrl) = currentRowsTableAll(colUrl) + 1
          End If
          'Increment Facebook counters
          currentRowsTableAll(colFacebook) = currentRowsTableAll(colFacebook) + 1
          addressCounters(colFacebook) = addressCounters(colFacebook) + 1

 End If

Next nodeOneLink

'Check address counters
    For checkCounters = colMail To colFacebook
        'Set comment if more than 1 link were found
        If addressCounters(checkCounters) > 1 Then
          Sheets(tableUrlsOneAddressLeft).Cells(currentRowTableUrls, checkCounters).AddComment Text:=CStr(addressCounters(checkCounters))
          Sheets(tableUrlsOneAddressLeft).Cells(currentRowTableUrls, checkCounters).Comment.Shape.TextFrame.AutoSize = True
        End If
    Next checkCounters
    Else
      'Page not loaded
      'Write message URL table
      Sheets(tableUrlsOneAddressLeft).Cells(currentRowTableUrls, colError).Value = "Error with URL or timeout"
    End If
    
'Prepare for next page
    pageLoadSuccessful = False
    Erase addressCounters
    lastRowTableAll = Sheets(tableAllAddresses).Cells(Rows.Count, colUrl).End(xlUp).Row
    For checkCounters = colUrl To colFacebook
      currentRowsTableAll(checkCounters) = lastRowTableAll + 1 'First rows for next page content
    Next checkCounters
    currentRowTableUrls = currentRowTableUrls + 1
  Loop
  
'Clean up
  Set http = Nothing
  Set htmlDoc = Nothing
  Set nodeAllLinks = Nothing
  Set nodeOneLink = Nothing

End Sub

一如既往的感谢.

推荐答案

我建议您先保存工作簿的副本,然后再运行答案中的代码.希望它能正常工作并满足您的需求,但我无法对其进行全面测试.

I recommend you save a copy of your workbook before running the code in my answer. Hopefully it works and does what you need it to, I wasn't able to test it fully.

  • 下面代码中的 GetEmailAddressesFromHtml 函数使用您包含在答案中的正则表达式提取电子邮件地址.

  • The GetEmailAddressesFromHtml function in the code below extracts email addresses using the regular expression you've included in your answer.

我认为以下代码没有做的一件事(您的代码做了)是在 Urls 工作表中添加注释.但是我认为您可以设置一个Excel公式(例如 COUNTIFS )来完成该操作.

I think one thing the code below doesn't do (which your code did) was add comments to the Urls worksheet. But you could set up an Excel formula (e.g. COUNTIFS) to do that in my opinion.

我仅添加唯一的Facebook URL和电子邮件地址,因此您不会在结果工作表上看到任何重复项.

I only add unique Facebook URLs and email addresses, so you shouldn't see any duplicates on the Results sheet.

Option Explicit
Option Private Module 'This option means that Subs of this module are not displayed in the macros of the Excel GUI

' https://docs.microsoft.com/en-us/office/vba/language/reference/user-interface-help/raise-method
Private Const ERR_REQUEST_FAILED As Long = 513
Private Const ERR_INVALID_HTML As Long = 514


Private Sub PrepareSourceSheet( _
        ByVal someSheet As Worksheet, ByVal firstRowToClear As Long, ByVal lastRowToClear As Long, _
        ByVal firstColumnToClear As Long, ByVal lastColumnToClear As Long)
    ' Should clear all contents and comments in the source sheet from email column to error column
        
    With someSheet
        With .Range(.Cells(firstRowToClear, firstColumnToClear), .Cells(lastRowToClear, lastColumnToClear))
            Debug.Assert Intersect(.Cells, .Parent.Columns(1)) Is Nothing
            .ClearContents
            .ClearComments
        End With
    End With
End Sub


Private Sub PrepareDestinationSheet(ByVal someSheet As Worksheet, ByVal firstRowToDelete As Long)
    'Should delete all rows starting from and including "firstRowToDelete".
    With someSheet
        .rows(firstRowToDelete & ":" & .rows.CountLarge).Delete Shift:=xlUp
    End With
End Sub


Private Sub ScrapeSomeData()
    'Columns for both tables
    Const COLUMN_URL As Long = 1 'Must always be the first column
    Const COLUMN_EMAIL As Long = 2 'Must always be the first column before Some platforms
    Const COLUMN_FACEBOOK As Long = 3 'Must always be the last column of Some platforms
    Const COLUMN_ERROR As Long = 4 'Must always be the last column
    
    Const FIRST_SOURCE_ROW As Long = 2 ' Skip headers
    Const FIRST_DESTINATION_ROW As Long = 2 ' Skip headers
    
    Dim sourceSheet As Worksheet
    Set sourceSheet = ThisWorkbook.Worksheets("Urls")
    
    Dim destinationSheet As Worksheet
    Set destinationSheet = ThisWorkbook.Worksheets("Results")
    
    Dim lastSourceRow As Long
    lastSourceRow = sourceSheet.Cells(sourceSheet.rows.Count, COLUMN_URL).End(xlUp).Row
    
    PrepareSourceSheet someSheet:=sourceSheet, firstRowToClear:=FIRST_SOURCE_ROW, _
            lastRowToClear:=lastSourceRow, firstColumnToClear:=COLUMN_EMAIL, lastColumnToClear:=COLUMN_ERROR
    PrepareDestinationSheet someSheet:=destinationSheet, firstRowToDelete:=FIRST_DESTINATION_ROW
    
    Dim destinationRowIndex As Long
    destinationRowIndex = FIRST_DESTINATION_ROW
    
    Dim sourceRowIndex As Long
    For sourceRowIndex = FIRST_SOURCE_ROW To lastSourceRow

        Dim data As Collection
        Set data = GetDataForUrl(sourceSheet.Cells(sourceRowIndex, COLUMN_URL))
        
        With destinationSheet
            Dim currentRowData As Variant
            For Each currentRowData In data
                .Cells(destinationRowIndex, COLUMN_URL).Value = currentRowData("url")
                .Cells(destinationRowIndex, COLUMN_EMAIL).Value = currentRowData("emailAddress")
                .Cells(destinationRowIndex, COLUMN_FACEBOOK).Value = currentRowData("facebookUrl")
                .Cells(destinationRowIndex, COLUMN_ERROR).Value = currentRowData("errorMessage")
                destinationRowIndex = destinationRowIndex + 1
            Next currentRowData
        End With
        
        With sourceSheet
            .Cells(sourceRowIndex, COLUMN_EMAIL).Value = data(1)("emailAddress")
            .Cells(sourceRowIndex, COLUMN_FACEBOOK).Value = data(1)("facebookUrl")
            .Cells(sourceRowIndex, COLUMN_ERROR).Value = data(1)("errorMessage")
        End With
        
        DoEvents
    Next sourceRowIndex
End Sub


Private Function GetHtmlFromUrl(ByVal someUrl As String) As Object
    ' Should return a HTML document. Raises an error if URL is unavailable
    ' (at the time of requesting) or if HTML could not be assigned.
    
    Dim httpClient As Object
    Set httpClient = CreateObject("MSXML2.ServerXMLHTTP.6.0")

    Dim succeeded As Boolean
    
    httpClient.Open "GET", someUrl, False
    On Error Resume Next
    httpClient.Send
    succeeded = (0 = Err.Number)
    On Error GoTo 0

    If Not succeeded Then Err.Raise ERR_REQUEST_FAILED, , "Error with URL or timeout"
    
    Dim htmlDocument As Object
    Set htmlDocument = CreateObject("htmlfile")
    
    On Error Resume Next
    htmlDocument.body.innerHTML = httpClient.responseText
    succeeded = (0 = Err.Number)
    On Error GoTo 0
    
    If Not succeeded Then Err.Raise ERR_INVALID_HTML, , "Error whilst assigning HTML"
    
    Set GetHtmlFromUrl = htmlDocument
End Function


Private Function GetFacebookUrlsFromHtml(ByVal htmlDocument As Object) As Collection
    ' Should return a collection of strings that are Facebook URLs detected.
    ' This function only looks within anchor tags.
    Dim outputCollection As Collection
    Set outputCollection = New Collection
    
    Dim allAnchorTags As Object
    Set allAnchorTags = htmlDocument.getElementsByTagName("a")
    
    Dim anchorTag As Object
    For Each anchorTag In allAnchorTags
        If InStr(1, UCase$(anchorTag.href), "FACEBOOK", vbBinaryCompare) > 0 Then
            On Error Resume Next
            outputCollection.Add anchorTag.href, Key:=anchorTag.href ' De-duplicate here
            On Error GoTo 0
        End If
    Next anchorTag
    
    Set GetFacebookUrlsFromHtml = outputCollection
End Function


Private Function GetEmailAddressesFromHtml(ByVal htmlDocument As Object) As Collection
    ' Should return a collection of strings representing email addresses detected
    ' in the HTML document.
    
    Dim outputCollection As Collection
    Set outputCollection = New Collection
    
    Dim regEx As Object
    Set regEx = CreateObject("VBScript.RegExp")
    
    With regEx
        .Pattern = "[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+"
        .Global = True
        Dim emailMatches As Object
        Set emailMatches = .Execute(htmlDocument.body.innerHTML)
    End With
    
    Dim matchFound As Object
    For Each matchFound In emailMatches
        On Error Resume Next ' De-duplicate here.
        outputCollection.Add matchFound.Value, Key:=matchFound.Value
        On Error GoTo 0
    Next matchFound
    
    Set GetEmailAddressesFromHtml = outputCollection
End Function


Private Function CreateRowDataForError(ByVal wasSuccess, ByVal errorMessage, ByVal someUrl As String) As Collection
    ' Context: An error has occurred and we don't have any data (Facebook URLs, email addresses).
    ' So can only return the URL attempted and the error message.
    
    Dim nestedCollection As Collection
    Set nestedCollection = New Collection
    
    nestedCollection.Add wasSuccess, "wasSuccess"
    nestedCollection.Add errorMessage, "errorMessage"
    nestedCollection.Add someUrl, "url"
    nestedCollection.Add vbNullString, "facebookUrl"
    nestedCollection.Add vbNullString, "emailAddress"
    
    Set CreateRowDataForError = New Collection
    CreateRowDataForError.Add nestedCollection
    
    Debug.Assert 1 = CreateRowDataForError.Count
End Function


Private Function CreateRowDataForResults(ByVal wasSuccess As Boolean, ByVal errorMessage As String, _
        ByVal someUrl As String, ByVal facebookUrls As Collection, ByVal emailAddresses As Collection) As Collection
    ' Context: No error occurred. HTML document may or may not contain data,
    ' but logic below should handle both scenarios.
    Dim nestedCollection As Collection
    
    Dim outerCollection As Collection
    Set outerCollection = New Collection
    
    Dim i As Long
    For i = 1 To Application.Max(1, facebookUrls.Count, emailAddresses.Count)
        Set nestedCollection = New Collection
        nestedCollection.Add wasSuccess, Key:="wasSuccess"
        nestedCollection.Add errorMessage, Key:="errorMessage"
        nestedCollection.Add someUrl, Key:="url"
        nestedCollection.Add GetCollectionItemOrDefault(facebookUrls, i, vbNullString), Key:="facebookUrl"
        nestedCollection.Add GetCollectionItemOrDefault(emailAddresses, i, vbNullString), Key:="emailAddress"
        outerCollection.Add nestedCollection
    Next i
    
    Debug.Assert outerCollection.Count = Application.Max(1, facebookUrls.Count, emailAddresses.Count)
    
    Set CreateRowDataForResults = outerCollection
End Function


Private Function GetDataForUrl(ByVal someUrl As String) As Collection
    ' Currently this function misuses Collection class. Should probably instead write a class to return a custom object/data structure.
    ' Returns a collection of nested collections, where each nested collection is as below:
    '   • "wasSuccess" = whether data was successfully retrieved
    '   • "errorMessage" = an error message mentioning what happened
    '   • "facebookUrl" = a Facebook URL detected
    '   • "emailAddress" = an email address detected
    
    Dim wasSuccess As Boolean
    Dim errorMessage As String
    Dim htmlDocument As Object
    
    On Error Resume Next
    Set htmlDocument = GetHtmlFromUrl(someUrl)
    
    wasSuccess = (0 = Err.Number)
    
    If Not wasSuccess Then
        errorMessage = IIf(ERR_REQUEST_FAILED = Err.Number Or ERR_INVALID_HTML = Err.Number, Err.Description, "Unexpected error occurred")
    End If
    On Error GoTo 0

    If Not wasSuccess Then
        Set GetDataForUrl = CreateRowDataForError(wasSuccess, errorMessage, someUrl)
    Else
        Dim facebookUrls As Collection
        Set facebookUrls = GetFacebookUrlsFromHtml(htmlDocument)
        
        Dim emailAddresses As Collection
        Set emailAddresses = GetEmailAddressesFromHtml(htmlDocument)
        
        Set GetDataForUrl = CreateRowDataForResults(wasSuccess, errorMessage, someUrl, facebookUrls, emailAddresses)
    End If

End Function


Private Function GetCollectionItemOrDefault(ByVal someCollection As Collection, ByVal someKey As Variant, ByVal someDefaultValue As Variant) As Variant
    ' Assumes item is not an object. This function will return false negatives if item being retrieved is an object.
    Dim succeeded As Boolean

    On Error Resume Next
    GetCollectionItemOrDefault = someCollection(someKey)
    succeeded = (0 = Err.Number)
    On Error GoTo 0
    
    If Not succeeded Then GetCollectionItemOrDefault = someDefaultValue
End Function


从维护和编码的角度来看,我认为像Node.js或Python这样的工具可以使您以更少的代码行完成相同的工作.


From a maintenance and coding perspective, I think something like Node.js or Python would allow you to get the same work done in fewer lines of code.

这篇关于批量电子邮件提取器.需要添加Regxp-Excel VBA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆