使用 VBA 进行网页抓取(当 HTML <> DOM 时) [英] Web Scraping with VBA (when HTML <> DOM)

查看:28
本文介绍了使用 VBA 进行网页抓取(当 HTML <> DOM 时)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在抓取这个特定网页的数据时遇到了一个可怕的时间...基本上,当我在浏览器中加载 URL 并手动按 F12 时,我可以在DOM 资源管理器"中看到我需要的信息,但是当我以编程方式尝试时做同样的事情(见下文),HTMLDoc 不包含我在DOM 资源管理器"中看到的相同信息......

公共子 testCode()将 IE 调暗为 SHDocVw.InternetExplorer将 HTMLDoc 变暗为 MSHTML.HTMLDocument设置 IE = 新的 SHDocVw.InternetExplorer用 IE.navigate "https://www.wunderground.com/cgi-bin/findweather/getForecast?query=EIDW"而 .Busy = True 或 .ReadyState <>READYSTATE_COMPLETE:温德设置 HTMLDoc = .Document结束于结束子

有人可以帮我访问DOM 资源管理器"中的信息吗?我知道 HTML 并不总是您在浏览器中看到的内容,而是创建您可以在浏览器中看到的内容的说明,但是必须有一种方法可以从 HTML 以编程方式创建 DOM...

此外,我相信我所追求的数据是由脚本或 iFrame 生成的,但我一直无法生成我正在寻找的数据,因为它们都被弄乱了......

更新

请参阅下面的 DOM Explorer 图片:

解决方案

大纲:

  • 将 GET XHR 发送到

    主要有6个部分,数据的相关部分被提取并输出到6个工作表(运行前必须手动创建):

    Sheet1 - 每日预测Sheet2 - Horly 预测Sheet3 - 响应数据(转置)Sheet4 - 当前数据(转置)Sheet5 - 天文学(转置)Sheet6 - 每小时历史数据

    有了这个例子,你就可以从那个 JSON 响应中提取你需要的数据.

    I have having a horrible time scraping this particular webpage's data... Basically I can see the information that I need in the 'DOM Explorer' when I load the URL in a browser and hit F12 manually, but when I programmatically attempt to do the same (see below) the HTMLDoc does not contain the same information that I can see in the 'DOM Explorer'...

    Public Sub testCode()
    
        Dim IE As SHDocVw.InternetExplorer
        Dim HTMLDoc As MSHTML.HTMLDocument
        Set IE = New SHDocVw.InternetExplorer
        With IE
            .navigate "https://www.wunderground.com/cgi-bin/findweather/getForecast?query=EIDW"
            While .Busy = True Or .ReadyState <> READYSTATE_COMPLETE: Wend
            Set HTMLDoc = .Document
        End With
    
    End Sub
    

    Can someone please help me access the information in the 'DOM Explorer'? I know that HTML is not always what you see in a browser, but rather the instructions to create what you can see in the browser, but then there must be a way to programmatically create the DOM from the HTML...

    Also, I believe that the data I am after is being generated by scripts or iFrames, but I have been unable to generate the data I am looking for from messing around with either....

    UPDATE

    See picture of DOM Explorer below:

    解决方案

    The outline:

    Actually web browser do almost the same stuff each time you open that webpage.

    You may use the below VBA code to parse response and output result. Import JSON.bas module into the VBA project for JSON processing.

    Sub TestScrapeWunderground()
    
        Dim sContent As String
        Dim sKey As String
        Dim sLocation As String
        Dim vJSON As Variant
        Dim sState As String
        Dim oDays As Object
        Dim oHours As Object
        Dim vDay As Variant
        Dim vHour As Variant
        Dim aRows() As Variant
        Dim aHeader() As Variant
    
        ' GET XHR to retrieve location and key
        With CreateObject("MSXML2.ServerXMLHTTP")
            .Open "GET", "https://www.wunderground.com/cgi-bin/findweather/getForecast?query=EIDW", False
            .Send
            sContent = .responseText
        End With
        ' Extract location and key from HTML content
        sLocation = Split(Split(sContent, "var query = 'zmw:' + '", 2)(1), "'", 2)(0)
        sKey = Split(Split(sContent, vbTab & "k: '", 2)(1), "'", 2)(0)
        ' GET XHR to retrieve JSON data
        With CreateObject("MSXML2.ServerXMLHTTP")
            .Open "GET", "https://api-ak-aws.wunderground.com/api/" & sKey & "/forecast10day/hourly10day/labels/conditions/astronomy10day/lang:en/units:metric/v:2.0/bestfct:1/q/zmw:" & sLocation & ".json", False
            .Send
            sContent = .responseText
        End With
        ' Parse JSON response to data structure
        JSON.Parse sContent, vJSON, sState
        ' Populate dictionaries with daily and hourly forecast data
        Set oDays = CreateObject("Scripting.Dictionary")
        Set oHours = CreateObject("Scripting.Dictionary")
        For Each vDay In vJSON("forecast")("days")
            oDays(vDay("summary")) = ""
            For Each vHour In vDay("hours")
                oHours(vHour) = ""
            Next
        Next
        ' Convert daily forecast data to arrays
        JSON.ToArray oDays.Keys(), aRows, aHeader
        ' Output daily forecast data to table
        With Sheets(1)
            .Cells.Delete
            OutputArray .Cells(1, 1), aHeader
            Output2DArray .Cells(2, 1), aRows
            .Columns.AutoFit
        End With
        ' Convert hourly forecast data to arrays
        JSON.ToArray oHours.Keys(), aRows, aHeader
        ' Output hourly forecast data to table
        With Sheets(2)
            .Cells.Delete
            OutputArray .Cells(1, 1), aHeader
            Output2DArray .Cells(2, 1), aRows
            .Columns.AutoFit
        End With
        ' Convert response data to arrays
        JSON.ToArray Array(vJSON("response")), aRows, aHeader
        ' Output response transposed data to table
        With Sheets(3)
            .Cells.Delete
            Output2DArray .Cells(1, 1), WorksheetFunction.Transpose(aHeader)
            Output2DArray .Cells(1, 2), WorksheetFunction.Transpose(aRows)
            .Columns.AutoFit
        End With
        ' Convert current data to arrays
        JSON.ToArray Array(vJSON("current_observation")), aRows, aHeader
        ' Output current transposed data to table
        With Sheets(4)
            .Cells.Delete
            Output2DArray .Cells(1, 1), WorksheetFunction.Transpose(aHeader)
            Output2DArray .Cells(1, 2), WorksheetFunction.Transpose(aRows)
            .Columns.AutoFit
        End With
        ' Populate dictionary with daily astronomy data
        Set oDays = CreateObject("Scripting.Dictionary")
        For Each vDay In vJSON("astronomy")("days")
            oDays(vDay) = ""
        Next
        ' Convert daily astronomy data to arrays
        JSON.ToArray oDays.Keys(), aRows, aHeader
        ' Output daily astronomy transposed data to table
        With Sheets(5)
            .Cells.Delete
            Output2DArray .Cells(1, 1), WorksheetFunction.Transpose(aHeader)
            Output2DArray .Cells(1, 2), WorksheetFunction.Transpose(aRows)
            .Columns.AutoFit
        End With
        ' Convert hourly history data to arrays
        JSON.ToArray vJSON("history")("days")(0)("hours"), aRows, aHeader
        ' Output hourly history data to table
        With Sheets(6)
            .Cells.Delete
            OutputArray .Cells(1, 1), aHeader
            Output2DArray .Cells(2, 1), aRows
            .Columns.AutoFit
        End With
        MsgBox "Completed"
    
    End Sub
    
    Sub OutputArray(oDstRng As Range, aCells As Variant)
    
        With oDstRng
            .Parent.Select
            With .Resize( _
                    1, _
                    UBound(aCells) - LBound(aCells) + 1)
                .NumberFormat = "@"
                .Value = aCells
            End With
        End With
    
    End Sub
    
    Sub Output2DArray(oDstRng As Range, aCells As Variant)
    
        With oDstRng
            .Parent.Select
            With .Resize( _
                    UBound(aCells, 1) - LBound(aCells, 1) + 1, _
                    UBound(aCells, 2) - LBound(aCells, 2) + 1)
                .NumberFormat = "@"
                .Value = aCells
            End With
        End With
    
    End Sub
    

    The second XHR returns JSON data, to make it clear how the necessary data is being extracted from it, you may save the JSON to file, copy the contents and paste it to any JSON viewer for further study. I use online tool http://jsonviewer.stack.hu, root element structure is shown below:

    There are 6 main sections, the relevant part of the data is extracted and output to 6 worksheets (which have to be created manually before run):

    Sheet1 - Daily forecast
    Sheet2 - Horly forecast
    Sheet3 - Response data (transposed)
    Sheet4 - Current data (transposed)
    Sheet5 - Astronomy (transposed)
    Sheet6 - Hourly history data
    

    Having that example you can extract the data you need from that JSON response.

    这篇关于使用 VBA 进行网页抓取(当 HTML &lt;&gt; DOM 时)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆