使用XMLHTTP使用vba进行网页抓取 [英] web scraping with vba using XMLHTTP
问题描述
我想从网页 http: //www.eex.com/en/market-data/power/derivatives-market/phelix-futures 。
如果我使用旧的InternetExplorer对象(下面的代码),我可以走过HTML文档。但我想用 XMLHTTP
对象(第二个代码)。
Sub IEZagon()
'我们定义了基本变量
Dim ie As Object
Dim TDelement,TDelements
Dim AnhorLink,AnhorLinks
'add the微软互联网控制引用在您的VBA项目间接
Set ie = CreateObject(InternetExplorer.Application)
使用ie
.Visible = True
.navigate([URL] http://www.eex.com/en/market-data/power/derivatives-market/phelix-futures[/URL])
虽然ie.ReadyState<> 4
DoEvents
Wend
Set AnhorLinks = .document.getElementsbytagname(a)
Set TDelements = .document.getElementsbytagname(td)
对于每个AnhorLink在AnhorLinks
Debug.Print AnhorLink.innertext
下一个
对于TDelements中的每个TDelement
Debug.Print TDelement.innertext
下一个
End With
设置ie = Nothing
End Sub
在XMLHTTP对象中使用代码:
Sub FuturesScrap(ByVal URL As String)
Dim XMLHttpRequest As XMLHTTP
Dim HTMLDoc As New HTMLDocument
Set XMLHttpRequest = New MSXML2.XMLHTTP
XMLHttpRequest.OpenGET,URL,False
XMLHttpRequest.send
虽然XMLHttpRequest.readyState<> 4
DoEvents
Wend
Debug.Print XMLHttpRequest.responseText
HTMLDoc.body.innerHTML = XMLHttpRequest.responseText
使用HTMLDoc.body
Set AnchorLinks = .getElementsByTagName(a)
Set TDelements = .getElementsByTagName(td)
对于AnchorLinks中的每个AnchorLink
Debug.Print AnhorLink。 innerText
Next
对于TDelements中的每个TDelement
Debug.Print TDelement.innerText
Next
End With
End Sub
我只得到基本的HTML:
< HTML>
< head>
< title>资源未找到< / title>
< link rel ='stylesheet'type ='text / css'href ='/ blueprint / css / errorpage.css'/>
< / head>
< body>
< table class =header>
< tr>
< td class =CMTitle CMHFill>< span class =large>资源未找到< / span>< / td>
< / tr>
< / table>
< div class =body>
< p style =font-weight:bold;>请求的资源不存在。< / p>
< / div>
< table class =footer>
< tr>
< td class =CMHFill> < / TD>
< / tr>
< / table>
< / body>
< / html>
我想通过表格和相应的数据进行步骤...
最后我会喜欢从年到月份选择不同的时间间隔:
我非常感谢任何帮助!谢谢!
当我运行代码(带或不带url标记)时,我可以确认获得相同的HTML。我找到了一个有用的帖子此处。我已经使用在那里找到的方法修改了你的代码,它现在似乎已经下载了正确的信息。
Sub test()
致电FuturesScrap1(http://www.eex.com/en/market-data/power/derivatives-market/phelix-futures)
结束Sub
我包含调用子,因为url标记似乎会导致MSXML请求出错。
Sub FuturesScrap1(ByVal URL As String)
Dim HTMLDoc As New HTMLDocument
Dim oHttp As MSXML2.XMLHTTP
Dim sHTML As String
Dim AnchorLinks As Object
Dim TDelements As Object
Dim TDelement As Object
Dim AnchorLink As Object
On Error Resume Next
Set oHttp = New MSXML2 .XMLHTTP
如果Err.Number<> 0然后
Set oHttp = CreateObject(MSXML.XMLHTTPRequest)
MsgBox创建MSXML.XMLHTTPRequest对象时发生错误0
End If
On Error GoTo 0
如果oHttp没有那么
MsgBox出于某种原因,我无法创建一个MSXML2.XMLHTTP对象
Exit Sub
End If
'在浏览器对象
中打开URL oHttp.OpenGET,URL,False
oHttp.send
sHTML = oHttp.responseText
Debug.Print oHttp。 responseText
HTMLDoc.body.innerHTML = oHttp.responseText
使用HTMLDoc.body
设置AnchorLinks = .getElementsByTagName(a)
设置TDelements = .getElementsByTagName(td)
对于AnchorLink中的每个AnchorLink
Debug.Print AnchorLink.innerText
下一个
对于TDelements中的每个TDelement
Debug.Print TDelement.innerText
下一个
End With
End Sub
编辑folowing评论:
我无法使用MSXML2对象查找表格元素,但源代码似乎并未包含它们。在firebug中,td标签是存在的,所以我认为表是由JavaScript代码生成的。我不知道MSXML2是否可以运行JavaScript,因此我修改了使用Internet Explorer的子程序,它不是快速代码,但它确实找到了td元素,并且确实允许单击这些标签。我发现td元素需要一些时间才能变得可用(大概是因为IE需要运行JavaScript),所以我已经在xl下载数据之前等待了几个步骤。
我已经放入了一些代码,将td元素的内容下载到活动工作表中,小心如果在工作簿中运行它,它。
Sub FuturesScrap3(ByVal URL As String)
Dim HTMLDoc As New HTMLDocument
Dim AnchorLinks As Object
Dim tdElements As Object
Dim tdElement As Object
Dim AnchorLink As Object
Dim lRow As Long
Dim oElement As Object
oIE InternetExplorer
设置oIE =新InternetExplorer
oIE.navigate URL
oIE.Visible = True
Do Until(oIE.readyState = 4而不是oIE.Busy)
DoEvents
Loop
'等待Javascript运行
Application.Wait(Now + TimeValue(0:01:00))
HTMLDoc.body.innerHTML = oIE.document.body.innerHTML
使用HTMLDoc.body
Set AnchorLinks = .getElementsB yTagName(a)
Set tdElements = .getElementsByTagName(td)'
对于AnchorLink中的每个AnchorLink
Debug.Print AnchorLink.innerText
下一个AnchorLink
End With
lRow = 1
对于tdElements中的每个tdElement
Debug.Print tdElement.innerText
单元格(lRow,1) .Value = tdElement.innerText
lRow = lRow + 1
下一个
'单击月份选项卡
对于每个oElement oIE.document.all
如果修剪(oElement.innerText)=月然后
oElement.Focus
oElement.Click
End If
Next oElement
Do Until(oIE .readyState = 4而不是oIE.Busy)
DoEvents
Loop
'等待Javascript运行
Application.Wait(Now + TimeValue(0:01 :00))
HTMLDoc.body.innerHTML = oIE.document.body.innerHTML
使用HTMLDoc.body
Set AnchorLinks = .getElementsByTagName(a)
Set tdElements = .getElementsByTagName(td)'
对于AnchorLinks中的每个AnchorLink
Debug.Print AnchorLink.innerText
Next AnchorLink
End With
lRow = 1
对于tdElements中的每个tdElement
Debug.Print tdElement.innerText
单元格(lRow,2) .Value = tdElement.innerText
lRow = lRow + 1
Next tdElement
End sub
I would like to get some data from web page http://www.eex.com/en/market-data/power/derivatives-market/phelix-futures.
If I'm using the old InternetExplorer object (code below), I could walking through HTML document. But I would like to use XMLHTTP
object (second code).
Sub IEZagon()
'we define the essential variables
Dim ie As Object
Dim TDelement, TDelements
Dim AnhorLink, AnhorLinks
'add the "Microsoft Internet Controls" reference in your VBA Project indirectly
Set ie = CreateObject("InternetExplorer.Application")
With ie
.Visible = True
.navigate ("[URL]http://www.eex.com/en/market-data/power/derivatives-market/phelix-futures[/URL]")
While ie.ReadyState <> 4
DoEvents
Wend
Set AnhorLinks = .document.getElementsbytagname("a")
Set TDelements = .document.getElementsbytagname("td")
For Each AnhorLink In AnhorLinks
Debug.Print AnhorLink.innertext
Next
For Each TDelement In TDelements
Debug.Print TDelement.innertext
Next
End With
Set ie = Nothing
End Sub
Using code with XMLHTTP object:
Sub FuturesScrap(ByVal URL As String)
Dim XMLHttpRequest As XMLHTTP
Dim HTMLDoc As New HTMLDocument
Set XMLHttpRequest = New MSXML2.XMLHTTP
XMLHttpRequest.Open "GET", URL, False
XMLHttpRequest.send
While XMLHttpRequest.readyState <> 4
DoEvents
Wend
Debug.Print XMLHttpRequest.responseText
HTMLDoc.body.innerHTML = XMLHttpRequest.responseText
With HTMLDoc.body
Set AnchorLinks = .getElementsByTagName("a")
Set TDelements = .getElementsByTagName("td")
For Each AnchorLink In AnchorLinks
Debug.Print AnhorLink.innerText
Next
For Each TDelement In TDelements
Debug.Print TDelement.innerText
Next
End With
End Sub
I get only basic HTML:
<html>
<head>
<title>Resource Not found</title>
<link rel= 'stylesheet' type='text/css' href='/blueprint/css/errorpage.css'/>
</head>
<body>
<table class="header">
<tr>
<td class="CMTitle CMHFill"><span class="large">Resource Not found</span></td>
</tr>
</table>
<div class="body">
<p style="font-weight:bold;">The requested resource does Not exist.</p>
</div>
<table class="footer">
<tr>
<td class="CMHFill"> </td>
</tr>
</table>
</body>
</html>
I would like to walking through tables and coresponding data... And finally I would like to select diferent time interval from Year to Month:
I'd really appreciate any help! Thank you!
I can confirm that I get the same HTML as you when I run your code (with or without the url tags). I found a useful post here. I have modified your code using the method found there and it now appears to have downloaded the correct information.
Sub test()
Call FuturesScrap1("http://www.eex.com/en/market-data/power/derivatives-market/phelix-futures")
End Sub
I included the calling sub because the url tags appeared to cause an error for the MSXML request.
Sub FuturesScrap1(ByVal URL As String)
Dim HTMLDoc As New HTMLDocument
Dim oHttp As MSXML2.XMLHTTP
Dim sHTML As String
Dim AnchorLinks As Object
Dim TDelements As Object
Dim TDelement As Object
Dim AnchorLink As Object
On Error Resume Next
Set oHttp = New MSXML2.XMLHTTP
If Err.Number <> 0 Then
Set oHttp = CreateObject("MSXML.XMLHTTPRequest")
MsgBox "Error 0 has occured while creating a MSXML.XMLHTTPRequest object"
End If
On Error GoTo 0
If oHttp Is Nothing Then
MsgBox "For some reason I wasn't able to make a MSXML2.XMLHTTP object"
Exit Sub
End If
'Open the URL in browser object
oHttp.Open "GET", URL, False
oHttp.send
sHTML = oHttp.responseText
Debug.Print oHttp.responseText
HTMLDoc.body.innerHTML = oHttp.responseText
With HTMLDoc.body
Set AnchorLinks = .getElementsByTagName("a")
Set TDelements = .getElementsByTagName("td")
For Each AnchorLink In AnchorLinks
Debug.Print AnchorLink.innerText
Next
For Each TDelement In TDelements
Debug.Print TDelement.innerText
Next
End With
End Sub
Edit folowing comment:
I haven't been able to find the table elements using MSXML2 object, the source code doesn't appear to contain them. In firebug the td tags are present so I thik that the table is generated by the JavaScript code. I don't know if MSXML2 can run the JavaScript so I've modified the sub to use internet explorer, it's not quick code, but it does find the td elements and does allow clicking the tabs. I have found that the td elements can take some time to become available (presumably for IE has to run the JavaScript) so I have put in a couple of steps where xl waits before downloading the data.
I have put in some code that will download the contents of the td elements into the active worksheet, be careful if running it in a workbook with useful data in it.
Sub FuturesScrap3(ByVal URL As String)
Dim HTMLDoc As New HTMLDocument
Dim AnchorLinks As Object
Dim tdElements As Object
Dim tdElement As Object
Dim AnchorLink As Object
Dim lRow As Long
Dim oElement As Object
Dim oIE As InternetExplorer
Set oIE = New InternetExplorer
oIE.navigate URL
oIE.Visible = True
Do Until (oIE.readyState = 4 And Not oIE.Busy)
DoEvents
Loop
'Wait for Javascript to run
Application.Wait (Now + TimeValue("0:01:00"))
HTMLDoc.body.innerHTML = oIE.document.body.innerHTML
With HTMLDoc.body
Set AnchorLinks = .getElementsByTagName("a")
Set tdElements = .getElementsByTagName("td") '
For Each AnchorLink In AnchorLinks
Debug.Print AnchorLink.innerText
Next AnchorLink
End With
lRow = 1
For Each tdElement In tdElements
Debug.Print tdElement.innerText
Cells(lRow, 1).Value = tdElement.innerText
lRow = lRow + 1
Next
'Clicking the Month tab
For Each oElement In oIE.document.all
If Trim(oElement.innerText) = "Month" Then
oElement.Focus
oElement.Click
End If
Next oElement
Do Until (oIE.readyState = 4 And Not oIE.Busy)
DoEvents
Loop
'Wait for Javascript to run
Application.Wait (Now + TimeValue("0:01:00"))
HTMLDoc.body.innerHTML = oIE.document.body.innerHTML
With HTMLDoc.body
Set AnchorLinks = .getElementsByTagName("a")
Set tdElements = .getElementsByTagName("td") '
For Each AnchorLink In AnchorLinks
Debug.Print AnchorLink.innerText
Next AnchorLink
End With
lRow = 1
For Each tdElement In tdElements
Debug.Print tdElement.innerText
Cells(lRow, 2).Value = tdElement.innerText
lRow = lRow + 1
Next tdElement
End sub
这篇关于使用XMLHTTP使用vba进行网页抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!