跨多页的网页抓取甚至不知道最后一个页码 [英] Web-scraping across multipages without even knowing the last page number

查看:25
本文介绍了跨多页的网页抓取甚至不知道最后一个页码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

运行我的网站代码以抓取分布在多个页面上的不同教程的标题,我发现它完美无缺.我尝试编写一些代码,而不取决于 url 的最后一个页码,而是根据状态代码,直到它显示 http.status<>200.在这种情况下,我在下面粘贴的代码工作得无可挑剔.但是,当我尝试使用另一个 url 来查看它是否自动中断但发现代码确实获取了所有结果但没有中断时出现了问题.在这种情况下,有什么解决方法可以使代码在完成时中断并停止宏?这是工作的吗?

Running my code for a site to crawl the titles of different tutorials spreading across several pages, I found it working flawless. I tried to write some code not depending on the last page number the url has but on the status code until it shows http.status<>200. The code I'm pasting below is working impeccably in this case. However, Trouble comes up when I try to use another url to see whether it breaks automatically but found that the code did fetch all the results but did not break. What is the workaround in this case so that the code will break when it is done and stop the macro? Here is the working one?

Sub WiseOwl()
Const mlink = "http://www.wiseowl.co.uk/videos/default"
Dim http As New XMLHTTP60, html As New HTMLDocument
Dim post As Object

Do While True
     y = y + 1
    With http
        .Open "GET", mlink & "-" & y & ".htm", False
        .send
        If .Status <> 200 Then
            MsgBox "It's done"
            Exit Sub
        End If
        html.body.innerHTML = .responseText
    End With

    For Each post In html.getElementsByClassName("woVideoListDefaultSeriesTitle")
        With post.getElementsByTagName("a")
            x = x + 1
            If .Length Then Cells(x, 1) = .item(0).innerText
        End With
    Next post
Loop
End Sub

我找到了处理黄页的逻辑.我的更新脚本能够解析黄页,但在抓取最后一页之前会中断,因为没有下一页"按钮.我试过这个:"https://www.dropbox.com/s/iptqm79b0byw3dz/Yellowpage.txt?dl=0"

I found a logic to get around with yellowpage. My update script is able to parse yellowpage but breaks before scraping the last page because there is no "Next Page" button. I tried with this: "https://www.dropbox.com/s/iptqm79b0byw3dz/Yellowpage.txt?dl=0"

但是,我尝试将相同的逻辑应用于 torrent 站点,但在这里不起作用:

However, the same logic I tried to apply with torrent site but it doesn't work here:

"https://www.yify-torrent.org/genres/western/p-1/"

推荐答案

无论元素是否退出,您都可以依赖它们.例如,如果您尝试使用已将元素设置为的对象,您将获得:

You can always rely on elements if they exits or not. Here for example, if you try to use the object which you have set your element to, you will get:

运行时错误91":对象变量或未设置块变量

Run-time error '91': Object variable or With block variable not set

这是结束代码时您应该寻找的关键.请看下面的例子:

This is the key you should be looking for to put an end to your code. Please see the below example:

Sub yify()
Const mlink = "https://www.yify-torrent.org/genres/western/p-"
Dim http As New XMLHTTP60, html As New HTMLDocument
Dim post As Object
Dim posts As Object

y = 1
Do
    With http
        .Open "GET", mlink & y & "/", False
        .send
        html.body.innerHTML = .responseText
    End With

    Set posts = html.getElementsByClassName("mv")
    On Error GoTo Endofpage
    Debug.Print Len(posts) 'to force Error 91

    For Each post In posts
        With post.getElementsByTagName("div")
            x = x + 1
            If .Length Then Cells(x, 1) = .Item(0).innerText
        End With
    Next post
    y = y + 1
Endofpage:
Loop Until Err.Number = 91
Debug.Print "It's over"
End Sub

这篇关于跨多页的网页抓取甚至不知道最后一个页码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆