数据表的 Excel VBA 网页抓取 [英] Excel VBA web scraping for data table

查看:39
本文介绍了数据表的 Excel VBA 网页抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从名册资源中获取数据,这是一个网页示例 (https://www.rosterresource.com/mlb-arizona-diamondbacks).至少,我想获得预计的首选"首发阵容,并将该数据导入到我的电子表格中.然后,我会为 Roster Resource 中的每个 MLB 球队执行此操作,以创建包含每个团队和每个团队的预计阵容的表格.

I am trying to get data from Roster Resource, here's an example of a webpage (https://www.rosterresource.com/mlb-arizona-diamondbacks). At the very minimum, I want to get the "Projected "Go-to" Starting Lineup" and import that data into my spreadsheet. I would then do this for every MLB team from Roster Resource to create a sheet that has every team and the projected lineup for each team.

我尝试了getElementById"和getElementsByClassName"的一些方法,但是我很难获得我想要的数据,因为这似乎只是网页上的一个非常大的表格.

I have tried some methods of "getElementById" and "getElementsByClassName", but I'm having difficulty getting to the data I want since this seems to be just one very large table on the webpage.

任何让我了解获取数据的正确方向的见解都会非常有帮助.

Any insight to get me on the right direction of getting the data would be very helpful.

推荐答案

如果您浏览网页 https://www.rosterresource.com/mlb-arizona-diamondbacks 并从表格的上下文菜单中选择Inspect element,您将在浏览器开发人员工具中看到整个表格所在的位置在一个框架内:

If you navigate the webpage https://www.rosterresource.com/mlb-arizona-diamondbacks and choose Inspect element from context menu on the table, you will see in browser developer tools that the whole table is located within a frame:

<iframe id="pageswitcher-content" frameborder="0" marginheight="0" marginwidth="0" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vSe6YBd7UW_ijhVHdRsM132Z3aUXUIzGuHcoqqdsr_nUXIYHbvRDFY0XCwGVndXJnWRaWVYhbeDbo5W/pubhtml/sheet?headers=false&amp;gid=1569103012" style="display: block; width: 100%; height: 100%;"></iframe>

因此实际上您需要从该 Google 电子表格文档中检索数据.这可以通过 XHR 和 Regex 来完成,如下面的代码所示:

So actually you need to retrieve the data from that Google Spreadsheet document. That could be done with XHR and Regex, as shown in the below code:

Option Explicit

Sub Test()

    Dim sContent As String
    Dim i As Long
    Dim j As Long
    Dim k As Long
    Dim aTables()
    Dim aHeader() As String
    Dim aRows() As String

    ' Retrieve HTML content via XHR
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://www.rosterresource.com/mlb-arizona-diamondbacks", False
        .Send
        sContent = .ResponseText
    End With
    ' Cut all before iframe URL
    sContent = Split(sContent, "<iframe src=""", 2)(1)
    ' Cut all after ? sign within URL
    sContent = Split(sContent, "?", 2)(0)
    ' Download google spreadsheet by extracted URL
    ' e. g. https://docs.google.com/spreadsheets/d/e/2PACX-1vQngsjnOpqkD8FQIOLn4cFayZTe4dl5VJZLNjMzji2Iq0dVXan7nj20Pq6oKnVS_HFla9e5GUtCyYl_/pubhtml
    ' e. g. https://docs.google.com/spreadsheets/d/e/2PACX-1vSe6YBd7UW_ijhVHdRsM132Z3aUXUIzGuHcoqqdsr_nUXIYHbvRDFY0XCwGVndXJnWRaWVYhbeDbo5W/pubhtml
    ' Retrieve HTML content via XHR
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", sContent, False
        .Send
        sContent = .ResponseText
    End With
    ' Parse with RegEx
    With CreateObject("VBScript.RegExp")
        .Global = True
        .MultiLine = True
        .IgnoreCase = True
        ' Process all tables within iframe content
        .Pattern = "<table[sS]*?>([sS]*?)</table>"
        With .Execute(sContent)
            ReDim aTables(0 To .Count - 1)
            For i = 0 To .Count - 1
                aTables(i) = .Item(i).SubMatches(0)
            Next
        End With
        For k = 0 To UBound(aTables)
            ' Minor HTML simplification
            sContent = aTables(k)
            ' Remove all tags except table formatting
            .Pattern = "<(?!/td|/tr|/th|(?:td|tr|th))[^>]*>|
|
|	"
            sContent = .Replace(sContent, "")
            ' Remove tags attributes
            .Pattern = "<(w+)[^>]+>"
            sContent = .Replace(sContent, "<$1>")
            ' Replace th with td
            .Pattern = "<(/?)th>"
            sContent = .Replace(sContent, "<$1td>")
            ' Replace HTML entities &name; &#number; with chars
            .Pattern = "&(?:w+|#d+);"
            .Global = False
            Do
                With .Execute(sContent)
                    If .Count = 0 Then Exit Do
                    sContent = Replace(sContent, .Item(0), DecodeHTMLEntities(.Item(0)))
                End With
            Loop
            .Global = True
           ' Extract rows
            .Pattern = "<tr>((?:<td>.*?</td>)+)</tr>"
            With .Execute(sContent)
                ReDim aRows(0 To .Count - 1, 0)
                For i = 0 To .Count - 1
                    aRows(i, 0) = .Item(i).SubMatches(0)
                Next
            End With
            ' Extract cells
            .Pattern = "<td>(.*?)</td>"
            For i = 0 To UBound(aRows, 1)
                With .Execute(aRows(i, 0))
                    For j = 0 To .Count - 1
                        If UBound(aRows, 2) < j Then ReDim Preserve aRows(UBound(aRows, 1), j)
                        aRows(i, j) = Trim(.Item(j).SubMatches(0))
                        DoEvents
                    Next
                End With
            Next
            aTables(k) = aRows
        Next
    End With
    ' Output
    With ThisWorkbook
        ' Remove all existing worksheets
        Application.DisplayAlerts = False
        .Sheets.Add , .Sheets(.Sheets.Count)
        Do While .Sheets.Count > 1
            .Sheets(1).Delete
        Loop
        Application.DisplayAlerts = True
        ' Output each table to separate worksheet
        For k = 0 To UBound(aTables)
            If .Sheets.Count < (k + 1) Then .Sheets.Add , .Sheets(.Sheets.Count)
            With .Sheets(k + 1)
                .Cells.Delete
                Output2DArray .Cells(1, 1), aTables(k)
                .Columns.AutoFit
            End With
        Next
    End With

End Sub

Function DecodeHTMLEntities(sText As String) As String

    Static oHtmlfile As Object
    Static oDiv As Object

    If oHtmlfile Is Nothing Then
        Set oHtmlfile = CreateObject("htmlfile")
        oHtmlfile.Open
        Set oDiv = oHtmlfile.createElement("div")
    End If
    oDiv.innerHTML = sText
    DecodeHTMLEntities = oDiv.innerText

End Function

Sub Output2DArray(oDstRng As Range, aCells As Variant)

    With oDstRng
        .Parent.Select
        With .Resize( _
                UBound(aCells, 1) - LBound(aCells, 1) + 1, _
                UBound(aCells, 2) - LBound(aCells, 2) + 1)
            .NumberFormat = "@"
            .Value = aCells
        End With
    End With

End Sub

通常不建议将 RegEx 用于 HTML 解析,因此有免责声明.在这种情况下处理的数据非常简单,这就是使用 RegEx 对其进行解析的原因.关于 RegEx:介绍(特别是语法)、JS 介绍VB 风格.简化使得 HTML 代码在某种程度上适合解析.顺便说一句使用相同的方法还有一个答案.

Generally RegEx's aren't recommended for HTML parsing, so there is disclaimer. Data being processed in this case is quite simple that is why it is parsed with RegEx. About RegEx: introduction (especially syntax), introduction JS, VB flavor. Simplification makes HTML code suitable for parsing in some degree. BTW there is one more answer using the same approach.

这篇关于数据表的 Excel VBA 网页抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆