使用 XMLHTTP 使用 vba 进行网页抓取 [英] web scraping with vba using XMLHTTP

查看:123
本文介绍了使用 XMLHTTP 使用 vba 进行网页抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从网页 http: 获取一些数据://www.eex.com/en/market-data/power/derivatives-market/phelix-futures.

如果我使用旧的 InternetExplorer 对象(下面的代码),我可以浏览 HTML 文档.但我想使用 XMLHTTP 对象(第二个代码).

子IEZagon()'我们定义了基本变量变暗为对象Dim TDelement, TDelementsDim AnhorLink, AnhorLinks'在您的 VBA 项目中间接添加Microsoft Internet Controls"引用Set ie = CreateObject("InternetExplorer.Application")与即.可见 = 真.navigate ("[URL]http://www.eex.com/en/market-data/power/derivatives-market/phelix-futures[/URL]")而 ie.ReadyState <>4事件温德设置 AnhorLinks = .document.getElementsbytagname("a")设置 TDelements = .document.getElementsbytagname("td")对于 AnhorLinks 中的每个 AnhorLinkDebug.Print AnhorLink.innertext下一个对于 TDelements 中的每个 TDelement调试.打印 TDelement.innertext下一个结束于设置 ie = 无结束子

使用带有 XMLHTTP 对象的代码:

Sub FuturesScrap(ByVal URL As String)将 XMLHttpRequest 淡化为 XMLHTTP将 HTMLDoc 淡化为新的 HTMLDocument设置 XMLHttpRequest = 新建 MSXML2.XMLHTTPXMLHttpRequest.Open "GET", URL, FalseXMLHttpRequest.send而 XMLHttpRequest.readyState <>4事件温德Debug.Print XMLHttpRequest.responseTextHTMLDoc.body.innerHTML = XMLHttpRequest.responseText使用 HTMLDoc.body设置 AnchorLinks = .getElementsByTagName("a")设置 TDelements = .getElementsByTagName("td")对于 AnchorLinks 中的每个 AnchorLinkDebug.Print AnhorLink.innerText下一个对于 TDelements 中的每个 TDelement调试.打印 TDelement.innerText下一个结束于结束子

我只得到基本的 HTML:

<头><title>未找到资源</title><link rel='stylesheet' type='text/css' href='/blueprint/css/errorpage.css'/><身体><table class="header"><tr><td class="CMTitle CMHFill"><span class="large">未找到资源</span></td></tr><div class="body"><p style="font-weight:bold;">请求的资源不存在.</p>

<table class="footer"><tr><td class="CMHFill"></td></tr>

我想遍历表格和相应的数据...最后我想从年到月选择不同的时间间隔:

我真的很感激任何帮助!谢谢!

解决方案

我可以确认,当我运行您的代码(带或不带 url 标记)时,我得到的 HTML 与您相同.我发现了一个有用的帖子 此处.我已经使用在那里找到的方法修改了您的代码,现在它似乎下载了正确的信息.

子测试()调用 FuturesScrap1("http://www.eex.com/en/market-data/power/derivatives-market/phelix-futures")结束子

我包含了调用子程序,因为 url 标记似乎会导致 MSXML 请求出错.

Sub FuturesScrap1(ByVal URL As String)将 HTMLDoc 淡化为新的 HTMLDocumentDim oHttp 作为 MSXML2.XMLHTTPDim sHTML As StringDim AnchorLinks 作为对象Dim TDelements 作为对象Dim TDelement 作为对象Dim AnchorLink 作为对象出错时继续下一步设置 oHttp = 新建 MSXML2.XMLHTTP如果 Err.Number <>0 那么Set oHttp = CreateObject("MSXML.XMLHTTPRequest")MsgBox "创建 MSXML.XMLHTTPRequest 对象时发生错误 0"万一出错时转到 0如果 oHttp 什么都没有,那么MsgBox "由于某种原因,我无法创建 MSXML2.XMLHTTP 对象"退出子万一'在浏览器对象中打开URLoHttp.Open "GET", URL, FalseoHttp.sendsHTML = oHttp.responseTextDebug.Print oHttp.responseTextHTMLDoc.body.innerHTML = oHttp.responseText使用 HTMLDoc.body设置 AnchorLinks = .getElementsByTagName("a")设置 TDelements = .getElementsByTagName("td")对于 AnchorLinks 中的每个 AnchorLinkDebug.Print AnchorLink.innerText下一个对于 TDelements 中的每个 TDelement调试.打印 TDelement.innerText下一个结束于结束子

编辑以下评论:

我无法使用 MSXML2 对象找到表格元素,源代码似乎不包含它们.在 firebug 中存在 td 标签,所以我认为该表是由 JavaScript 代码生成的.我不知道 MSXML2 是否可以运行 JavaScript,所以我修改了 sub 以使用 Internet Explorer,它不是快速代码,但它确实找到了 td 元素并允许单击选项卡.我发现 td 元素可能需要一些时间才能可用(大概是因为 IE 必须运行 JavaScript),所以我做了几个步骤,其中 xl 在下载数据之前等待.

我已经放入了一些代码,可以将 td 元素的内容下载到活动工作表中,如果在包含有用数据的工作簿中运行它,请小心.

Sub FuturesScrap3(ByVal URL As String)将 HTMLDoc 淡化为新的 HTMLDocumentDim AnchorLinks 作为对象Dim tdElements 作为对象Dim tdElement 作为对象Dim AnchorLink 作为对象将 lRow 调暗至长Dim oElement 作为对象Dim oIE 作为 Internet Explorer设置 oIE = 新 InternetExploreroIE.navigate URLoIE.Visible = True直到(oIE.readyState = 4 而不是 oIE.Busy)事件环形'等待Javascript运行Application.Wait (Now + TimeValue("0:01:00"))HTMLDoc.body.innerHTML = oIE.document.body.innerHTML使用 HTMLDoc.body设置 AnchorLinks = .getElementsByTagName("a")设置 tdElements = .getElementsByTagName("td") '对于 AnchorLinks 中的每个 AnchorLinkDebug.Print AnchorLink.innerText下一个锚链接结束于行 = 1对于 tdElements 中的每个 tdElementDebug.Print tdElement.innerTextCells(lRow, 1).Value = tdElement.innerTextlRow = lRow + 1下一个'单击月选项卡对于 oIE.document.all 中的每个 oElementIf Trim(oElement.innerText) = "Month" ThenoElement.FocusoElement.Click万一下一个元素直到(oIE.readyState = 4 而不是 oIE.Busy)事件环形'等待Javascript运行Application.Wait (Now + TimeValue("0:01:00"))HTMLDoc.body.innerHTML = oIE.document.body.innerHTML使用 HTMLDoc.body设置 AnchorLinks = .getElementsByTagName("a")设置 tdElements = .getElementsByTagName("td") '对于 AnchorLinks 中的每个 AnchorLinkDebug.Print AnchorLink.innerText下一个锚链接结束于行 = 1对于 tdElements 中的每个 tdElementDebug.Print tdElement.innerTextCells(lRow, 2).Value = tdElement.innerTextlRow = lRow + 1下一个 tdElement结束子

I would like to get some data from web page http://www.eex.com/en/market-data/power/derivatives-market/phelix-futures.

If I'm using the old InternetExplorer object (code below), I could walking through HTML document. But I would like to use XMLHTTP object (second code).

Sub IEZagon() 
     'we define the essential variables
    Dim ie As Object 
    Dim TDelement, TDelements 
    Dim AnhorLink, AnhorLinks 

     'add the "Microsoft Internet Controls" reference in your VBA Project indirectly
    Set ie = CreateObject("InternetExplorer.Application") 
    With ie 
        .Visible = True 
        .navigate ("[URL]http://www.eex.com/en/market-data/power/derivatives-market/phelix-futures[/URL]") 
        While ie.ReadyState <> 4 
            DoEvents 
        Wend 
        Set AnhorLinks = .document.getElementsbytagname("a") 
        Set TDelements = .document.getElementsbytagname("td") 
        For Each AnhorLink In AnhorLinks 
            Debug.Print AnhorLink.innertext 
        Next 
        For Each TDelement In TDelements 
            Debug.Print TDelement.innertext 
        Next 
    End With 
    Set ie = Nothing 
End Sub

Using code with XMLHTTP object:

Sub FuturesScrap(ByVal URL As String) 
    Dim XMLHttpRequest As XMLHTTP 
    Dim HTMLDoc As New HTMLDocument 

    Set XMLHttpRequest = New MSXML2.XMLHTTP 
    XMLHttpRequest.Open "GET", URL, False 
    XMLHttpRequest.send 
    While XMLHttpRequest.readyState <> 4 
        DoEvents 
    Wend 

    Debug.Print XMLHttpRequest.responseText 
    HTMLDoc.body.innerHTML = XMLHttpRequest.responseText 

    With HTMLDoc.body 
        Set AnchorLinks = .getElementsByTagName("a") 
        Set TDelements = .getElementsByTagName("td") 

        For Each AnchorLink In AnchorLinks 
            Debug.Print AnhorLink.innerText 
        Next 

        For Each TDelement In TDelements 
            Debug.Print TDelement.innerText 
        Next 
    End With 
End Sub 

I get only basic HTML:

<html> 
<head> 
<title>Resource Not found</title> 
<link rel= 'stylesheet' type='text/css' href='/blueprint/css/errorpage.css'/>
</head> 
<body> 
<table class="header"> 
<tr> 
<td class="CMTitle CMHFill"><span class="large">Resource Not found</span></td> 
</tr> 
</table> 
<div class="body"> 
<p style="font-weight:bold;">The requested resource does Not exist.</p> 
</div> 
<table class="footer"> 
<tr> 
<td class="CMHFill"> </td> 
</tr> 
</table> 
</body> 
</html>

I would like to walking through tables and coresponding data... And finally I would like to select diferent time interval from Year to Month:

I'd really appreciate any help! Thank you!

解决方案

I can confirm that I get the same HTML as you when I run your code (with or without the url tags). I found a useful post here. I have modified your code using the method found there and it now appears to have downloaded the correct information.

Sub test()
    Call FuturesScrap1("http://www.eex.com/en/market-data/power/derivatives-market/phelix-futures")
End Sub

I included the calling sub because the url tags appeared to cause an error for the MSXML request.

Sub FuturesScrap1(ByVal URL As String)
    Dim HTMLDoc As New HTMLDocument
    Dim oHttp As MSXML2.XMLHTTP
    Dim sHTML As String
    Dim AnchorLinks As Object
    Dim TDelements As Object
    Dim TDelement As Object
    Dim AnchorLink As Object

    On Error Resume Next
    Set oHttp = New MSXML2.XMLHTTP
    If Err.Number <> 0 Then
        Set oHttp = CreateObject("MSXML.XMLHTTPRequest")
        MsgBox "Error 0 has occured while creating a MSXML.XMLHTTPRequest object"
    End If
    On Error GoTo 0
    If oHttp Is Nothing Then
        MsgBox "For some reason I wasn't able to make a MSXML2.XMLHTTP object"
        Exit Sub
    End If

    'Open the URL in browser object
    oHttp.Open "GET", URL, False
    oHttp.send
    sHTML = oHttp.responseText

    Debug.Print oHttp.responseText

    HTMLDoc.body.innerHTML = oHttp.responseText

    With HTMLDoc.body
        Set AnchorLinks = .getElementsByTagName("a")
        Set TDelements = .getElementsByTagName("td")

        For Each AnchorLink In AnchorLinks
            Debug.Print AnchorLink.innerText
        Next

        For Each TDelement In TDelements
            Debug.Print TDelement.innerText
        Next
    End With

End Sub

Edit folowing comment:

I haven't been able to find the table elements using MSXML2 object, the source code doesn't appear to contain them. In firebug the td tags are present so I thik that the table is generated by the JavaScript code. I don't know if MSXML2 can run the JavaScript so I've modified the sub to use internet explorer, it's not quick code, but it does find the td elements and does allow clicking the tabs. I have found that the td elements can take some time to become available (presumably for IE has to run the JavaScript) so I have put in a couple of steps where xl waits before downloading the data.

I have put in some code that will download the contents of the td elements into the active worksheet, be careful if running it in a workbook with useful data in it.

Sub FuturesScrap3(ByVal URL As String)

    Dim HTMLDoc As New HTMLDocument
    Dim AnchorLinks As Object
    Dim tdElements As Object
    Dim tdElement As Object
    Dim AnchorLink As Object
    Dim lRow As Long
    Dim oElement As Object

    Dim oIE As InternetExplorer

    Set oIE = New InternetExplorer

    oIE.navigate URL
    oIE.Visible = True

    Do Until (oIE.readyState = 4 And Not oIE.Busy)
        DoEvents
    Loop

    'Wait for Javascript to run
    Application.Wait (Now + TimeValue("0:01:00"))

    HTMLDoc.body.innerHTML = oIE.document.body.innerHTML

    With HTMLDoc.body
        Set AnchorLinks = .getElementsByTagName("a")
        Set tdElements = .getElementsByTagName("td") '

        For Each AnchorLink In AnchorLinks
            Debug.Print AnchorLink.innerText
        Next AnchorLink

    End With

    lRow = 1
    For Each tdElement In tdElements
        Debug.Print tdElement.innerText
        Cells(lRow, 1).Value = tdElement.innerText
        lRow = lRow + 1
    Next

    'Clicking the Month tab
    For Each oElement In oIE.document.all
        If Trim(oElement.innerText) = "Month" Then
            oElement.Focus
            oElement.Click
        End If
    Next oElement

    Do Until (oIE.readyState = 4 And Not oIE.Busy)
        DoEvents
    Loop

    'Wait for Javascript to run
    Application.Wait (Now + TimeValue("0:01:00"))

    HTMLDoc.body.innerHTML = oIE.document.body.innerHTML

    With HTMLDoc.body
        Set AnchorLinks = .getElementsByTagName("a")
        Set tdElements = .getElementsByTagName("td") '

        For Each AnchorLink In AnchorLinks
            Debug.Print AnchorLink.innerText
        Next AnchorLink
    End With

    lRow = 1
    For Each tdElement In tdElements
        Debug.Print tdElement.innerText
        Cells(lRow, 2).Value = tdElement.innerText
        lRow = lRow + 1
    Next tdElement

End sub

这篇关于使用 XMLHTTP 使用 vba 进行网页抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆