在for循环内进行Webscrape-后续操作 [英] Webscrape inside a for loop - Follow up

查看:36
本文介绍了在for循环内进行Webscrape-后续操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

按照我的上一个问题使用条件抓取VBA ,我开始尝试自动执行此网站此处中的网址列表操作在我的Excel文档中准备.当我尝试使用20和30 url时,它可以正常工作,但是当我增加它时,会出现脚本超出范围错误"提示.在GetNodesTextAsArray中发生了与ReDim有关的问题,您知道为什么吗?经过一番研究,我试图用for循环替换它,但是它并没有改变任何东西.

Following my previous question Webscrape VBA with condition, I started trying to automate the procedure for a list of url from this website here that I prepared in my excel document. When I tried for 20 and 30 url it worked perfectly, yet when I increased it, a "Script out of range error" occurred concerning the ReDim in the GetNodesTextAsArray, do you have any idea why ? After some research I Tried to replace it by a for loop but it doesn't chagne anything.

Public Sub WindInfo()
'VBE> Tools > References:
'1. Microsoft, XML v6
'2. Microsoft HTML Object Library
'3. Microsoft Scripting Runtime
Dim xhr As MSXML2.XMLHTTP60: Set xhr = New MSXML2.XMLHTTP60
Dim html As MSHTML.HTMLDocument: Set html = New MSHTML.HTMLDocument
Dim ws As Worksheet: Set ws = ThisWorkbook.Worksheets("Sheet1")
Dim url As String
Dim j As Integer
Dim r As Long


r = 1

For j = 1 To 20

url = Worksheets("List").Cells(j, 1).Value

    With xhr
        .Open "GET", url, False
        .send
        html.body.innerHTML = .responseText
    End With

    Dim generalities As Object, arrGen(), partsList As Object
    
    

    Set generalities = html.querySelectorAll("#bloc_texte table ~ table li")
    arrGen = GetNodesTextAsArray(generalities)
    
    Dim parts As Object, numberOfParts As Long
    
    Set partsList = html.querySelectorAll("h1 ~ h3, ul ~ h3")
    
    
    If partsList.Length > 0 Then
    
        numberOfParts = html.querySelectorAll("h1 ~ h3, ul ~ h3").Length / 2
    
        Set parts = html.querySelectorAll("h3 + ul")
       
        Dim i As Long, liNodes As Object, arr()
        Dim html2 As MSHTML.HTMLDocument: Set html2 = New MSHTML.HTMLDocument
        
        For i = 0 To numberOfParts - 1
            ws.Cells(r, 1).Resize(1, UBound(arrGen)) = arrGen
            html2.body.innerHTML = parts.Item(i).outerHTML & parts.Item(i + numberOfParts).outerHTML
            Set liNodes = html2.querySelectorAll("li")
            arr = GetNodesTextAsArray(liNodes)
            ws.Cells(r, 5).Resize(1, UBound(arr)) = arr
            r = r + 1
        Next
        
    Else
        arr = GetNodesTextAsArray(html.querySelectorAll("#bloc_texte h1 + ul").Item(1).getElementsByTagName("li"))
        ws.Cells(r, 1).Resize(1, UBound(arrGen)) = arrGen
        ws.Cells(r, 5).Resize(1, UBound(arr)) = arr
        r = r + 1
    End If
    Application.Wait (Now + TimeValue("0:00:01"))
Next

End Sub


Public Function GetNodesTextAsArray(ByVal nodeList As Object) As Variant()
Dim i As Long, results()

ReDim results(1 To nodeList.Length)

   

For i = 0 To nodeList.Length - 1
    results(i + 1) = nodeList.Item(i).innerText
Next i
GetNodesTextAsArray = results
End Function

推荐答案

9/10 VBA网络抓取的问题与页面加载有关.

9/10 the problem with VBA webscraping is page loading related.

所以您要做的是尝试一个错误处理程序,该错误处理程序在页面加载时循环播放,然后进行测试以查看元素是否出现在页面上.

So what you want to do is try an error handler that Loops while the page loads, and then tests to see if the element appears on the page.

我会把代码留给您,但是这里的时间基本上就是一切,因为即使状态码检查和等待文档加载检查也不能保证您需要的所有内容都已加载.

I'll leave the code to you, but the timing here is essentially everything since even status code checking and waiting for the document to be loaded checking, is not a guarantee everything you need is loaded.

即使使用硒和木偶等出色的工具,您仍然需要检查/处理元素的加载和时间安排,并且不能仅仅假设由于加载了文档,因此元素就在那里了.

Even with great tools like selenium and puppeteer you still need to check / handle element loading and timing, and cant just assume, since document loaded, the element is there.

有意义吗?

这篇关于在for循环内进行Webscrape-后续操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆