网站数据表刮板 [英] Website data table scraper
问题描述
我试图使用VBA创建一个网页刮刀,将数据导入到excel中,根据我在下面的代码提取中的评论,我能够找到的最好的是在这个问题。
使用investing.com作为一个例子,但实际上我的项目将跨越多个站点,并将纳入一个矩阵,将随着事件的到期而每天更新和自我消耗 - 因此,我宁愿在代码端尽可能地减少输入(对我来说)。
考虑到这一点,请问是否有办法执行以下(大括号,这将是一些令人费解的基础知识):
-
有没有一种方法可以导航到一个url,并为每个循环运行
在该页面的每个表上(没有任何已知的ID)?这是为了加快我的代码尽可能减少我的输入,因为将有相当多的数据要更新,我打算在刷新2分钟循环触发器。
-
而不是做下面我已经做的,可以引用一个表,而不是一行,并做一些沿着单元格的行(2,5).value返回第1行第4列中的值? (假设数组索引在两个维度都从0开始)?此外,我的第一列(我的主键在某些方面)在所有来源上可能不是相同的顺序,所以有一种方法我可以做相当于
列(A:A)。Find(What:= [Primary key],After:= Cells(1,1),LookIn:= xlValues,LookAt:= xlWhole,SearchOrder: = xlByRows,SearchDirection:= xlNext,MatchCase:= True,SearchFormat:= False).Row
以查找表中甚至我正在寻找什么行?
代码:
Sub Scraper )
Dim appIE,allRowOfData As Object
'根据https://stackoverflow.com/questions/27066963/scraping-data-from-website-using-vba
设置appIE = CreateObject(internetexplorer.application)
使用appIE
.Navigatehttp://uk.investing.com/rates-bonds/financial-futures示例页
.Visible = False
结束
尽管appIE.Busy
应用程序n.Wait(Now + TimeValue(0:00:01))'如果页面未打开,请等待一秒钟再次尝试
循环
设置allRowOfData = appIE.document.getElementById (pair_8907)
'tr id =[表内行数]
Dim myValue As String:myValue = allRowOfData.Cells(8).innerHTML
'8是列的列号
'(注意:列号从0开始,所以第9列应该在此处输入8
设置appIE = Nothing
范围(A1)Value = myValue
End Sub
如果要使用Excel函数导航表,为什么不将表首先转储到工作表上,这段代码对我有用
Option Explicit
Sub Scraper()
Dim appIE As Object
'根据http://stackoverflow.com/questions/ 27066963 / scratching-data-from-website-using-vba
设置appIE = CreateObject(internetexplorer.application)
使用appIE
.Navigatehttp://uk.investing.com/rates-bonds/financial-futures'示例页
.Visible = True
结束
Do While appIE.Busy
DoEvents
Application.Wait(Now + TimeValue(0:00:01))'如果页面未打开,请等待再次尝试
循环
'Debug.Print TypeName(appIE.document)
Dim doc As Object'MSHTML.HTMLDocument
设置doc = appIE.document
'* appIE busy是好的,但你需要等待整个文档完全加载和初始化,所以使用这个
虽然doc.readyState<> 完成
DoEvents
Wend
'*我们可以选择所有的表,因为它们共享相同的CSS类名称
Dim tablesSelectedByClass As Object'MSHTML.HTMLElementCollection
设置tablesSelectedByClass = doc.getElementsByClassName(genTbl)
'*你可以改变这一点,只是方便我添加工作簿
Dim shNewResults作为Excel。工作表
设置shNewResults = ThisWorkbook.Worksheets.Add
Dim lRowCursor As Long'*这个控件粘贴表单
lRowCursor = 1
Dim lTableIndexLoop As Long
对于lTableIndexLoop = 0 To tablesSelectedByClass.Length - 1
Dim tableLoop As Object'MSHTML.HTMLTable
设置tableLoop = tablesSelectedByClass.Item(lTableIndexLoop)
如果LenB(tableLoop.ID)> 0然后'*有一些额外的无意义的表,这个子选择
Dim sParentColumn As String,objParentColumn As Object'MSHTML.HTMLSemanticElement
Set objParentColumn = FindMyColumn(tableLoop,sParentColumn)'*需要理解是左侧或右侧的表
Dim vHeader As Variant:vHeader = Empty
如果sParentColumn =leftColumn,则左侧的
'*表具有前一个H3元素与表的描述
Dim objH3Headers As Object
设置objH3Headers = objParentColumn.getElementsByTagName(H3)
vHeader = objH3Headers.Item(lTableIndexLoop).innerText
Else
'*表右边有一个隐藏属性我们可以使用
vHeader = tableLoop.Attributes.Item(data-gae)。value
如果Len(vHeader)> 3然后
vHeader = Mid $(vHeader,4)
中$(vHeader,1,1)= Chr(Asc(Mid $(vHeader,1,1))) - 32)
结束如果
结束如果
'*表右侧没有列标题
Dim bHasColumnHeaders As Boolean
bHasColumnHeaders =(tableLoop.ChildNodes.Length = 2)
Dim vTableCells()As Variant'*这将是我们的表数据容器,我们将一次粘贴
Dim lRowCount As Long:lRowCount = 0
Dim lColumnCount As Long :lColumnCount = 0
Dim lDataHeadersSectionIdx As Long:lDataHeadersSectionIdx = 0
Dim objColumnHeaders As Object:Set objColumnHeaders = Nothing
如果bHasColumnHeaders然后
设置objColumnHeaders = tableLoop.ChildNodes.Item(0).ChildNodes.Item(0)
lRowCount = lRowCount + 1
lDataHeadersSectionIdx = 1
Else
lDataHeadersSectionIdx = 0
End If
Dim objDataRows As Object'MSHTML.HTMLElementCollection
Set objDataRows = tableLoop.ChildNodes.Item( lDataHeadersSectionIdx).ChildNodes
lColumnCount = objDataRows.Item(0).ChildNodes.Length
lRowCount = lRowCount + objDataRows.Length
ReDim vTableCells(1 To lRowCount, 1到lColumnCount)作为变量
'*我们有他们得到列标题
Dim lColLoop As Long
如果bHasColumnHeaders然后
对于lColLoop = 1到lColumnCount
vTableCells(1,lColLoop)= objColumnHeaders.ChildNodes.Item(lColLoop - 1).innerText
下一个
结束如果
'*获取数据单元格
Dim lRowLoop As Long
Fo r lRowLoop = 1到lRowCount - VBA.IIf(bHasColumnHeaders,1,0)
对于lColLoop = 1到lColumnCount
vTableCells(lRowLoop + VBA.IIf(bHasColumnHeaders,1,0),lColLoop)= objDataRows .Item(lRowLoop - 1).ChildNodes.Item(lColLoop - 1).innerText
下一个
下一个
'*粘贴我们的表描述
shNewResults.Cells lRowCursor,1).Value2 = vHeader
lRowCursor = lRowCursor + 1
'*粘贴我们的表数据
shNewResults.Cells(lRowCursor,1).Resize(lRowCount,lColumnCount) .Value2 = vTableCells
lRowCursor = lRowCursor + lRowCount + 1
结束如果
下一个
End Sub
函数FindMyColumn (ByVal节点As Object,ByRef psColumn As String)As Object
'*此代码升级DOM,在每个节点的ID中查找列
当InStr(1,node.ID,列,vbTextCompare) = 0而不是node.ParentNode不是
DoEvents
设置node = node.ParentNode
Wend
如果InStr(1,node.ID,column,vbTextCompare)> 0然后
设置FindMyColumn = node
psColumn = CStr(node.ID)
如果
结束函数
顺便说一下,如果你交易很多经纪人变得富有,而且你变得贫穷,经纪费就会长期影响。 >
Before I ask my question, I'm an amateur coder with basically no meaningful experience beyond VBA in ms office apps (I know - noob!)
I'm trying to create a web scraper using VBA to import data into excel and as per my comments in the below extract of code, the best I've been able to find on this is was in the winning answer to this question.
Below, I'm using investing.com as an example but in reality my project will be across multiple sites and will feed into a matrices which will be updating daily and self cannibalizing as events expire - For this reason I'd rather front-up the workload on the code side to make the inputs on an ongoing basis as minimal as possible (for me).
With that in mind, can I ask if there's a way to do any of the following (brace yourself, this will be cringe-worthy basic knowledge for some):
Is there a way in which I can and navigate to a url and run a
for each
loop on every table on that page (without have a known id for any)? this is to speed up my code as much as it's to minimise my inputs as there'll be quite a bit of data to be updated and I was planning on putting a 2 minute looping trigger on the refresh.Instead of doing what I've been doing below, is it possible to reference a table, rather than a row, and do something along the lines of Cells(2,5).value to return the value within row 1, column 4? (assuming that both the array indexing starts at 0 in both dimensions?) Further to that, my first column (my primary key in some ways) may not be in the same order on all sources so is there a way in which I could do the equivalent to
Columns("A:A").Find(What:=[Primary key], After:=Cells(1, 1), LookIn:=xlValues, LookAt:=xlWhole, SearchOrder:=xlByRows, SearchDirection:=xlNext, MatchCase:=True, SearchFormat:=False).Row
to find what row within the table relates to the even I'm looking for?
Code :
Sub Scraper()
Dim appIE, allRowOfData As Object
' As per https://stackoverflow.com/questions/27066963/scraping-data-from-website-using-vba
Set appIE = CreateObject("internetexplorer.application")
With appIE
.Navigate "http://uk.investing.com/rates-bonds/financial-futures" 'Sample page
.Visible = False
End With
Do While appIE.Busy
Application.Wait (Now + TimeValue("0:00:01")) 'If page not open, wait a second befor trying again
Loop
Set allRowOfData = appIE.document.getElementById("pair_8907")
'tr id="[ID of row within table]"
Dim myValue As String: myValue = allRowOfData.Cells(8).innerHTML
'The 8 is the column number of the table
'(note: column numbers start at 0 so the 9th column should have "8" entered here
Set appIE = Nothing
Range("A1").Value = myValue
End Sub
If you want to use Excel functions to navigate the tables why not dump the tables first onto a worksheet this code works for me
Option Explicit
Sub Scraper()
Dim appIE As Object
' As per http://stackoverflow.com/questions/27066963/scraping-data-from-website-using-vba
Set appIE = CreateObject("internetexplorer.application")
With appIE
.Navigate "http://uk.investing.com/rates-bonds/financial-futures" 'Sample page
.Visible = True
End With
Do While appIE.Busy
DoEvents
Application.Wait (Now + TimeValue("0:00:01")) 'If page not open, wait a second befor trying again
Loop
'Debug.Print TypeName(appIE.document)
Dim doc As Object 'MSHTML.HTMLDocument
Set doc = appIE.document
'* appIE busy is good but you need to wait for the whole document to completely load and initialise so use this
While doc.readyState <> "complete"
DoEvents
Wend
'* we can select all the tables because they share the same CSS class name
Dim tablesSelectedByClass As Object 'MSHTML.HTMLElementCollection
Set tablesSelectedByClass = doc.getElementsByClassName("genTbl")
'* you can change this, it was just convenient for me to add sheets to my workbook
Dim shNewResults As Excel.Worksheet
Set shNewResults = ThisWorkbook.Worksheets.Add
Dim lRowCursor As Long '* this controls pasting down the sheet
lRowCursor = 1
Dim lTableIndexLoop As Long
For lTableIndexLoop = 0 To tablesSelectedByClass.Length - 1
Dim tableLoop As Object 'MSHTML.HTMLTable
Set tableLoop = tablesSelectedByClass.Item(lTableIndexLoop)
If LenB(tableLoop.ID) > 0 Then '* there are some extra nonsense tables, this subselects
Dim sParentColumn As String, objParentColumn As Object ' MSHTML.HTMLSemanticElement
Set objParentColumn = FindMyColumn(tableLoop, sParentColumn) '* need to understand is table on left hand or right hand side
Dim vHeader As Variant: vHeader = Empty
If sParentColumn = "leftColumn" Then
'* tables on the left have a preceding H3 element with the table's description
Dim objH3Headers As Object
Set objH3Headers = objParentColumn.getElementsByTagName("H3")
vHeader = objH3Headers.Item(lTableIndexLoop).innerText
Else
'* tables on the right have a hidden attribute we can use
vHeader = tableLoop.Attributes.Item("data-gae").Value
If Len(vHeader) > 3 Then
vHeader = Mid$(vHeader, 4)
Mid$(vHeader, 1, 1) = Chr(Asc(Mid$(vHeader, 1, 1)) - 32)
End If
End If
'* tables on the right do not have column headers
Dim bHasColumnHeaders As Boolean
bHasColumnHeaders = (tableLoop.ChildNodes.Length = 2)
Dim vTableCells() As Variant '* this will be our table data container which we will paste in one go
Dim lRowCount As Long: lRowCount = 0
Dim lColumnCount As Long: lColumnCount = 0
Dim lDataHeadersSectionIdx As Long: lDataHeadersSectionIdx = 0
Dim objColumnHeaders As Object: Set objColumnHeaders = Nothing
If bHasColumnHeaders Then
Set objColumnHeaders = tableLoop.ChildNodes.Item(0).ChildNodes.Item(0)
lRowCount = lRowCount + 1
lDataHeadersSectionIdx = 1
Else
lDataHeadersSectionIdx = 0
End If
Dim objDataRows As Object 'MSHTML.HTMLElementCollection
Set objDataRows = tableLoop.ChildNodes.Item(lDataHeadersSectionIdx).ChildNodes
lColumnCount = objDataRows.Item(0).ChildNodes.Length
lRowCount = lRowCount + objDataRows.Length
ReDim vTableCells(1 To lRowCount, 1 To lColumnCount) As Variant
'* we have them get the column headers
Dim lColLoop As Long
If bHasColumnHeaders Then
For lColLoop = 1 To lColumnCount
vTableCells(1, lColLoop) = objColumnHeaders.ChildNodes.Item(lColLoop - 1).innerText
Next
End If
'* get the data cells
Dim lRowLoop As Long
For lRowLoop = 1 To lRowCount - VBA.IIf(bHasColumnHeaders, 1, 0)
For lColLoop = 1 To lColumnCount
vTableCells(lRowLoop + VBA.IIf(bHasColumnHeaders, 1, 0), lColLoop) = objDataRows.Item(lRowLoop - 1).ChildNodes.Item(lColLoop - 1).innerText
Next
Next
'* paste our table description
shNewResults.Cells(lRowCursor, 1).Value2 = vHeader
lRowCursor = lRowCursor + 1
'* paste our table data
shNewResults.Cells(lRowCursor, 1).Resize(lRowCount, lColumnCount).Value2 = vTableCells
lRowCursor = lRowCursor + lRowCount + 1
End If
Next
End Sub
Function FindMyColumn(ByVal node As Object, ByRef psColumn As String) As Object
'* this code ascends the DOM looking for "column" in the id of each node
While InStr(1, node.ID, "column", vbTextCompare) = 0 And Not node.ParentNode Is Nothing
DoEvents
Set node = node.ParentNode
Wend
If InStr(1, node.ID, "column", vbTextCompare) > 0 Then
Set FindMyColumn = node
psColumn = CStr(node.ID)
End If
End Function
By the way, if you trade a lot the brokers get rich and you get poor, brokerage charges really impact in long run.
这篇关于网站数据表刮板的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!