网站数据表刮板 [英] Website data table scraper

查看:136
本文介绍了网站数据表刮板的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我问我的问题之前,我是一个业余编码人员,基本上没有任何有意义的经验超越VBA在ms办公室应用程序(我知道 - noob!)



我试图使用VBA创建一个网页刮刀,将数据导入到excel中,根据我在下面的代码提取中的评论,我能够找到的最好的是在这个问题



使用investing.com作为一个例子,但实际上我的项目将跨越多个站点,并将纳入一个矩阵,将随着事件的到期而每天更新和自我消耗 - 因此,我宁愿在代码端尽可能地减少输入(对我来说)。



考虑到这一点,请问是否有办法执行以下(大括号,这将是一些令人费解的基础知识):


  1. 有没有一种方法可以导航到一个url,并为每个循环运行在该页面的每个表上(没有任何已知的ID)?这是为了加快我的代码尽可能减少我的输入,因为将有相当多的数据要更新,我打算在刷新2分钟循环触发器。


  2. 而不是做下面我已经做的,可以引用一个表,而不是一行,并做一些沿着单元格的行(2,5).value返回第1行第4列中的值? (假设数组索引在两个维度都从0开始)?此外,我的第一列(我的主键在某些方面)在所有来源上可能不是相同的顺序,所以有一种方法我可以做相当于列(A:A)。Find(What:= [Primary key],After:= Cells(1,1),LookIn:= xlValues,LookAt:= xlWhole,SearchOrder: = xlByRows,SearchDirection:= xlNext,MatchCase:= True,SearchFormat:= False).Row 以查找表中甚至我正在寻找什么行?

代码:

  Sub Scraper )
Dim appIE,allRowOfData As Object

'根据https://stackoverflow.com/questions/27066963/scraping-data-from-website-using-vba

设置appIE = CreateObject(internetexplorer.application)

使用appIE
.Navigatehttp://uk.investing.com/rates-bonds/financial-futures示例页
.Visible = False
结束

尽管appIE.Busy
应用程序n.Wait(Now + TimeValue(0:00:01))'如果页面未打开,请等待一秒钟再次尝试
循环

设置allRowOfData = appIE.document.getElementById (pair_8907)
'tr id =[表内行数]
Dim myValue As String:myValue = allRowOfData.Cells(8).innerHTML
'8是列的列号
'(注意:列号从0开始,所以第9列应该在此处输入8

设置appIE = Nothing

范围(A1)Value = myValue

End Sub


解决方案

如果要使用Excel函数导航表,为什么不将表首先转储到工作表上,这段代码对我有用

  Option Explicit 

Sub Scraper()
Dim appIE As Object

'根据http://stackoverflow.com/questions/ 27066963 / scratching-data-from-website-using-vba

设置appIE = CreateObject(internetexplorer.application)

使用appIE
.Navigatehttp://uk.investing.com/rates-bonds/financial-futures'示例页
.Visible = True
结束

Do While appIE.Busy
DoEvents
Application.Wait(Now + TimeValue(0:00:01))'如果页面未打开,请等待再次尝试
循环

'Debug.Print TypeName(appIE.document)

Dim doc As Object'MSHTML.HTMLDocument
设置doc = appIE.document

'* appIE busy是好的,但你需要等待整个文档完全加载和初始化,所以使用这个
虽然doc.readyState<> 完成
DoEvents
Wend

'*我们可以选择所有的表,因为它们共享相同的CSS类名称
Dim tablesSelectedByClass As Object'MSHTML.HTMLElementCollection
设置tablesSelectedByClass = doc.getElementsByClassName(genTbl)

'*你可以改变这一点,只是方便我添加工作簿
Dim shNewResults作为Excel。工作表
设置shNewResults = ThisWorkbook.Worksheets.Add

Dim lRowCursor As Long'*这个控件粘贴表单
lRowCursor = 1

Dim lTableIndexLoop As Long
对于lTableIndexLoop = 0 To tablesSelectedByClass.Length - 1

Dim tableLoop As Object'MSHTML.HTMLTable
设置tableLoop = tablesSelectedByClass.Item(lTableIndexLoop)

如果LenB(tableLoop.ID)> 0然后'*有一些额外的无意义的表,这个子选择

Dim sParentColumn As String,objParentColumn As Object'MSHTML.HTMLSemanticElement
Set objParentColumn = FindMyColumn(tableLoop,sParentColumn)'*需要理解是左侧或右侧的表

Dim vHeader As Variant:vHeader = Empty
如果sParentColumn =leftColumn,则左侧的
'*表具有前一个H3元素与表的描述
Dim objH3Headers As Object
设置objH3Headers = objParentColumn.getElementsByTagName(H3)
vHeader = objH3Headers.Item(lTableIndexLoop).innerText
Else
'*表右边有一个隐藏属性我们可以使用
vHeader = tableLoop.Attributes.Item(data-gae)。value
如果Len(vHeader)> 3然后
vHeader = Mid $(vHeader,4)
中$(vHeader,1,1)= Chr(Asc(Mid $(vHeader,1,1))) - 32)
结束如果
结束如果

'*表右侧没有列标题
Dim bHasColumnHeaders As Boolean
bHasColumnHeaders =(tableLoop.ChildNodes.Length = 2)

Dim vTableCells()As Variant'*这将是我们的表数据容器,我们将一次粘贴
Dim lRowCount As Long:lRowCount = 0
Dim lColumnCount As Long :lColumnCount = 0
Dim lDataHeadersSectionIdx As Long:lDataHeadersSectionIdx = 0
Dim objColumnHeaders As Object:Set objColumnHeaders = Nothing

如果bHasColumnHeaders然后

设置objColumnHeaders = tableLoop.ChildNodes.Item(0).ChildNodes.Item(0)

lRowCount = lRowCount + 1

lDataHeadersSectionIdx = 1
Else
lDataHeadersSectionIdx = 0
End If

Dim objDataRows As Object'MSHTML.HTMLElementCollection
Set objDataRows = tableLoop.ChildNodes.Item( lDataHeadersSectionIdx).ChildNodes
lColumnCount = objDataRows.Item(0).ChildNodes.Length

lRowCount = lRowCount + objDataRows.Length

ReDim vTableCells(1 To lRowCount, 1到lColumnCount)作为变量

'*我们有他们得到列标题
Dim lColLoop As Long
如果bHasColumnHeaders然后
对于lColLoop = 1到lColumnCount
vTableCells(1,lColLoop)= objColumnHeaders.ChildNodes.Item(lColLoop - 1).innerText
下一个
结束如果

'*获取数据单元格
Dim lRowLoop As Long
Fo r lRowLoop = 1到lRowCount - VBA.IIf(bHasColumnHeaders,1,0)
对于lColLoop = 1到lColumnCount
vTableCells(lRowLoop + VBA.IIf(bHasColumnHeaders,1,0),lColLoop)= objDataRows .Item(lRowLoop - 1).ChildNodes.Item(lColLoop - 1).innerText
下一个
下一个

'*粘贴我们的表描述
shNewResults.Cells lRowCursor,1).Value2 = vHeader
lRowCursor = lRowCursor + 1

'*粘贴我们的表数据
shNewResults.Cells(lRowCursor,1).Resize(lRowCount,lColumnCount) .Value2 = vTableCells
lRowCursor = lRowCursor + lRowCount + 1
结束如果

下一个

End Sub

函数FindMyColumn (ByVal节点As Object,ByRef psColumn As String)As Object
'*此代码升级DOM,在每个节点的ID中查找列
当InStr(1,node.ID,列,vbTextCompare) = 0而不是node.ParentNode不是
DoEvents
设置node = node.ParentNode
Wend
如果InStr(1,node.ID,column,vbTextCompare)> 0然后
设置FindMyColumn = node
psColumn = CStr(node.ID)
如果


结束函数

顺便说一下,如果你交易很多经纪人变得富有,而且你变得贫穷,经纪费就会长期影响。 >

Before I ask my question, I'm an amateur coder with basically no meaningful experience beyond VBA in ms office apps (I know - noob!)

I'm trying to create a web scraper using VBA to import data into excel and as per my comments in the below extract of code, the best I've been able to find on this is was in the winning answer to this question.

Below, I'm using investing.com as an example but in reality my project will be across multiple sites and will feed into a matrices which will be updating daily and self cannibalizing as events expire - For this reason I'd rather front-up the workload on the code side to make the inputs on an ongoing basis as minimal as possible (for me).

With that in mind, can I ask if there's a way to do any of the following (brace yourself, this will be cringe-worthy basic knowledge for some):

  1. Is there a way in which I can and navigate to a url and run a for each loop on every table on that page (without have a known id for any)? this is to speed up my code as much as it's to minimise my inputs as there'll be quite a bit of data to be updated and I was planning on putting a 2 minute looping trigger on the refresh.

  2. Instead of doing what I've been doing below, is it possible to reference a table, rather than a row, and do something along the lines of Cells(2,5).value to return the value within row 1, column 4? (assuming that both the array indexing starts at 0 in both dimensions?) Further to that, my first column (my primary key in some ways) may not be in the same order on all sources so is there a way in which I could do the equivalent to Columns("A:A").Find(What:=[Primary key], After:=Cells(1, 1), LookIn:=xlValues, LookAt:=xlWhole, SearchOrder:=xlByRows, SearchDirection:=xlNext, MatchCase:=True, SearchFormat:=False).Row to find what row within the table relates to the even I'm looking for?

Code :

Sub Scraper()
Dim appIE, allRowOfData As Object

' As per https://stackoverflow.com/questions/27066963/scraping-data-from-website-using-vba

Set appIE = CreateObject("internetexplorer.application")

With appIE
   .Navigate "http://uk.investing.com/rates-bonds/financial-futures" 'Sample page
   .Visible = False
End With

Do While appIE.Busy
    Application.Wait (Now + TimeValue("0:00:01")) 'If page not open, wait a second befor trying again
Loop

Set allRowOfData = appIE.document.getElementById("pair_8907") 
'tr id="[ID of row within table]"
Dim myValue As String: myValue = allRowOfData.Cells(8).innerHTML 
'The 8 is the column number of the table 
'(note: column numbers start at 0 so the 9th column should have "8" entered here

Set appIE = Nothing

Range("A1").Value = myValue

End Sub

解决方案

If you want to use Excel functions to navigate the tables why not dump the tables first onto a worksheet this code works for me

Option Explicit

Sub Scraper()
    Dim appIE As Object

    ' As per http://stackoverflow.com/questions/27066963/scraping-data-from-website-using-vba

    Set appIE = CreateObject("internetexplorer.application")

    With appIE
       .Navigate "http://uk.investing.com/rates-bonds/financial-futures" 'Sample page
       .Visible = True
    End With

    Do While appIE.Busy
        DoEvents
        Application.Wait (Now + TimeValue("0:00:01")) 'If page not open, wait a second befor trying again
    Loop

    'Debug.Print TypeName(appIE.document)

    Dim doc As Object 'MSHTML.HTMLDocument
    Set doc = appIE.document

    '* appIE busy is good but you need to wait for the whole document to completely load and initialise so use this
    While doc.readyState <> "complete"
        DoEvents
    Wend

    '* we can select all the tables because they share the same CSS class name
    Dim tablesSelectedByClass As Object 'MSHTML.HTMLElementCollection
    Set tablesSelectedByClass = doc.getElementsByClassName("genTbl")

    '* you can change this, it was just convenient for me to add sheets to my workbook
    Dim shNewResults As Excel.Worksheet
    Set shNewResults = ThisWorkbook.Worksheets.Add

    Dim lRowCursor As Long  '* this controls pasting down the sheet
    lRowCursor = 1

    Dim lTableIndexLoop As Long
    For lTableIndexLoop = 0 To tablesSelectedByClass.Length - 1

        Dim tableLoop As Object 'MSHTML.HTMLTable
        Set tableLoop = tablesSelectedByClass.Item(lTableIndexLoop)

        If LenB(tableLoop.ID) > 0 Then  '* there are some extra nonsense tables, this subselects

            Dim sParentColumn As String, objParentColumn As Object ' MSHTML.HTMLSemanticElement
            Set objParentColumn = FindMyColumn(tableLoop, sParentColumn) '* need to understand is table on left hand or right hand side

            Dim vHeader As Variant: vHeader = Empty
            If sParentColumn = "leftColumn" Then
                '* tables on the left have a preceding H3 element with the table's description
                Dim objH3Headers As Object
                Set objH3Headers = objParentColumn.getElementsByTagName("H3")
                vHeader = objH3Headers.Item(lTableIndexLoop).innerText
            Else
                '* tables on the right have a hidden attribute we can use
                vHeader = tableLoop.Attributes.Item("data-gae").Value
                If Len(vHeader) > 3 Then
                    vHeader = Mid$(vHeader, 4)
                    Mid$(vHeader, 1, 1) = Chr(Asc(Mid$(vHeader, 1, 1)) - 32)
                End If
            End If

            '* tables on the right do not have column headers
            Dim bHasColumnHeaders As Boolean
            bHasColumnHeaders = (tableLoop.ChildNodes.Length = 2)

            Dim vTableCells() As Variant   '* this will be our table data container which we will paste in one go
            Dim lRowCount As Long: lRowCount = 0
            Dim lColumnCount As Long: lColumnCount = 0
            Dim lDataHeadersSectionIdx As Long: lDataHeadersSectionIdx = 0
            Dim objColumnHeaders As Object: Set objColumnHeaders = Nothing

            If bHasColumnHeaders Then

                Set objColumnHeaders = tableLoop.ChildNodes.Item(0).ChildNodes.Item(0)

                lRowCount = lRowCount + 1

                lDataHeadersSectionIdx = 1
            Else
                lDataHeadersSectionIdx = 0
            End If

            Dim objDataRows As Object 'MSHTML.HTMLElementCollection
            Set objDataRows = tableLoop.ChildNodes.Item(lDataHeadersSectionIdx).ChildNodes
            lColumnCount = objDataRows.Item(0).ChildNodes.Length

            lRowCount = lRowCount + objDataRows.Length

            ReDim vTableCells(1 To lRowCount, 1 To lColumnCount) As Variant

            '* we have them get the column headers
            Dim lColLoop As Long
            If bHasColumnHeaders Then
                For lColLoop = 1 To lColumnCount
                    vTableCells(1, lColLoop) = objColumnHeaders.ChildNodes.Item(lColLoop - 1).innerText
                Next
            End If

            '* get the data cells
            Dim lRowLoop As Long
            For lRowLoop = 1 To lRowCount - VBA.IIf(bHasColumnHeaders, 1, 0)
                For lColLoop = 1 To lColumnCount
                    vTableCells(lRowLoop + VBA.IIf(bHasColumnHeaders, 1, 0), lColLoop) = objDataRows.Item(lRowLoop - 1).ChildNodes.Item(lColLoop - 1).innerText
                Next
            Next

            '* paste our table description
            shNewResults.Cells(lRowCursor, 1).Value2 = vHeader
            lRowCursor = lRowCursor + 1

            '* paste our table data
            shNewResults.Cells(lRowCursor, 1).Resize(lRowCount, lColumnCount).Value2 = vTableCells
            lRowCursor = lRowCursor + lRowCount + 1
        End If

    Next

End Sub

Function FindMyColumn(ByVal node As Object, ByRef psColumn As String) As Object
    '* this code ascends the DOM looking for "column" in the id of each node
    While InStr(1, node.ID, "column", vbTextCompare) = 0 And Not node.ParentNode Is Nothing
        DoEvents
        Set node = node.ParentNode
    Wend
    If InStr(1, node.ID, "column", vbTextCompare) > 0 Then
        Set FindMyColumn = node
        psColumn = CStr(node.ID)
    End If


End Function

By the way, if you trade a lot the brokers get rich and you get poor, brokerage charges really impact in long run.

这篇关于网站数据表刮板的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆