如何获取通向下一页的所有链接? [英] How to get all the links leading to the next page?

查看：26 发布时间：2021/9/22 20:30:36 vba web-scraping web-crawler

本文介绍了如何获取通向下一页的所有链接?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经在 vba 中编写了一些代码来获取从网页通向下一页的所有链接.下一页链接的最高数量是 255.运行我的脚本，我得到了 6906 个链接内的所有链接.这意味着循环一次又一次地运行，我正在覆盖一些东西.过滤掉重复链接我可以看到有 254 个唯一链接.我的目标不是将最高页码硬编码到迭代链接.这是我正在尝试的内容:

Sub YifyLink()常量链接 = "https://www.yify-torrent.org/search/1080p/"Dim http As New XMLHTTP60, html As New HTMLDocument, htm As New HTMLDocumentDim x As Long, y As Long, item_link as String使用 http.打开GET"，链接，错误.发送html.body.innerHTML = .responseText结束于对于每个帖子在 html.getElementsByClassName("pager")(0).getElementsByTagName("a")如果 InStr(post.innerText, "Last") 那么x = Split(Split(post.href, "-")(1), "/")(0)万一下一篇对于 y = 0 到 xitem_link = 链接 &"t-" &y&/"使用 http.打开GET", item_link, False.发送htm.body.innerHTML = .responseText结束于对于每个帖子在 htm.getElementsByClassName("pager")(0).getElementsByTagName("a")I = I + 1: Cells(I, 1) = posts.href下一篇文章下一个结束子

链接所在的元素:


1<a href="/search/1080p/t-2/">2</a><a href="/search/1080p/t-3/">3</a><a href="/search/1080p/t-4/">4</a><a href="/search/1080p/t-5/">5</a><a href="/search/1080p/t-6/">6</a><a href="/search/1080p/t-7/">7</a><a href="/search/1080p/t-8/">8</a><a href="/search/1080p/t-9/">9</a><a href="/search/1080p/t-10/">10</a><a href="/search/1080p/t-11/">11</a><a href="/search/1080p/t-12/">12</a><a href="/search/1080p/t-13/">13</a><a href="/search/1080p/t-14/">14</a><a href="/search/1080p/t-15/">15</a><a href="/search/1080p/t-16/">16</a><a href="/search/1080p/t-17/">17</a><a href="/search/1080p/t-18/">18</a><a href="/search/1080p/t-19/">19</a><a href="/search/1080p/t-20/">20</a><a href="/search/1080p/t-21/">21</a><a href="/search/1080p/t-22/">22</a><a href="/search/1080p/t-23/">23</a><a href="/search/1080p/t-2/">下一步</a><a href="/search/1080p/t-255/">最后</a>

我得到的结果(部分):

about:/search/1080p/t-20/关于:/搜索/1080p/t-21/关于:/搜索/1080p/t-22/关于:/搜索/1080p/t-23/关于:/搜索/1080p/t-255/

解决方案

这个想法应该是在循环中抓取页面并找到要比较的内容，如果不正确，则退出循环.

这可能是，即根据字典检查键，或检查元素是否存在，或任何其他可能特定于您的问题的逻辑.

例如，您的问题是，该站点一直为后面的页面显示第 255 页.所以这对我们来说是一个线索.我们可以将属于页面 (n) 的元素与属于页面 (n-1) 的元素进行比较.

例如，如果第 256 页中的元素与第 255 页中的元素相同，则退出循环/子.请参阅下面的示例代码:

Sub yify()const mlink = "https://www.yify-torrent.org/search/1080p/t-"Dim http As New XMLHTTP60, html As New HTMLDocumentDim 帖子作为对象，帖子作为对象Dim pageno As Long，rowno As Long页码 = 1行号 = 1做使用 http.打开GET"，链接&佩吉诺&/， 错误的.发送html.body.innerHTML = .responseText结束于设置帖子 = html.getElementsByClassName("mv")If Cells(rowno, 1) = posts(17).getElementsByTagName("a")(0).innerText 然后退出执行对于每个帖子 在帖子中使用 post.getElementsByTagName("div")如果 .Length 那么行号 = 行号 + 1单元格(rowno, 1) = .Item(0).innerText万一结束于下一篇Debug.Print "pageno: " &佩吉诺&"完全的."页码 = 页码 + 1环形结束子

I've written some code in vba to get all the links leading to the next page from a webpage. The highest number of next page links is 255. Running my script, I get all the links within 6906 links. That means the loop runs again and again and I'm overwriting stuffs. Filtering out duplicate links I could see that 254 unique links are there. My objective here is not to hardcode the highest page number to the link for iteration. Here is what I'm trying with:

Sub YifyLink()
    Const link = "https://www.yify-torrent.org/search/1080p/"
    Dim http As New XMLHTTP60, html As New HTMLDocument, htm As New HTMLDocument
    Dim x As Long, y As Long, item_link as String

    With http
        .Open "GET", link, False
        .send
        html.body.innerHTML = .responseText
    End With

    For Each post In html.getElementsByClassName("pager")(0).getElementsByTagName("a")
        If InStr(post.innerText, "Last") Then
            x = Split(Split(post.href, "-")(1), "/")(0)
        End If
    Next post
    For y = 0 To x
        item_link = link & "t-" & y & "/"

        With http
            .Open "GET", item_link, False
            .send
            htm.body.innerHTML = .responseText
        End With
        For Each posts In htm.getElementsByClassName("pager")(0).getElementsByTagName("a")
            I = I + 1: Cells(I, 1) = posts.href
        Next posts
    Next y
End Sub

Elements within which the links are:

<div class="pager"><a href="/search/1080p/" class="current">1</a> <a href="/search/1080p/t-2/">2</a> <a href="/search/1080p/t-3/">3</a> <a href="/search/1080p/t-4/">4</a> <a href="/search/1080p/t-5/">5</a> <a href="/search/1080p/t-6/">6</a> <a href="/search/1080p/t-7/">7</a> <a href="/search/1080p/t-8/">8</a> <a href="/search/1080p/t-9/">9</a> <a href="/search/1080p/t-10/">10</a> <a href="/search/1080p/t-11/">11</a> <a href="/search/1080p/t-12/">12</a> <a href="/search/1080p/t-13/">13</a> <a href="/search/1080p/t-14/">14</a> <a href="/search/1080p/t-15/">15</a> <a href="/search/1080p/t-16/">16</a> <a href="/search/1080p/t-17/">17</a> <a href="/search/1080p/t-18/">18</a> <a href="/search/1080p/t-19/">19</a> <a href="/search/1080p/t-20/">20</a> <a href="/search/1080p/t-21/">21</a> <a href="/search/1080p/t-22/">22</a> <a href="/search/1080p/t-23/">23</a> <a href="/search/1080p/t-2/">Next</a> <a href="/search/1080p/t-255/">Last</a> </div>

The results I'm getting (partial portion):

about:/search/1080p/t-20/
about:/search/1080p/t-21/
about:/search/1080p/t-22/
about:/search/1080p/t-23/
about:/search/1080p/t-255/

解决方案

The idea should be to scrape pages in a loop and find something to compare, if not true, then exit loop.

This might be, i.e. checking the key against a dictionary, or checking if elements exits, or any other logic that might be specific to your problem.

For example, here your problem is, the site keeps displaying page 255 for the latter pages. So this is a clue for us. We can compare an element that belongs to page (n) with an element that belongs to page (n-1).

For instance, if element in page 256 is the same as element in page 255, then exit loop/sub. Please see the sample code below:

Sub yify()
Const mlink = "https://www.yify-torrent.org/search/1080p/t-"
Dim http As New XMLHTTP60, html As New HTMLDocument
Dim post As Object, posts As Object
Dim pageno As Long, rowno As Long

pageno = 1
rowno = 1

Do
    With http
        .Open "GET", mlink & pageno & "/", False
        .send
        html.body.innerHTML = .responseText
    End With

    Set posts = html.getElementsByClassName("mv")
    If Cells(rowno, 1) = posts(17).getElementsByTagName("a")(0).innerText Then Exit Do

    For Each post In posts
        With post.getElementsByTagName("div")
            If .Length Then
                rowno = rowno + 1
                Cells(rowno, 1) = .Item(0).innerText
            End If
        End With
    Next post
    Debug.Print "pageno: " & pageno & " completed."
    pageno = pageno + 1
Loop
End Sub

这篇关于如何获取通向下一页的所有链接?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何获取通向下一页的所有链接? [英] How to get all the links leading to the next page?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何获取通向下一页的所有链接? [英] How to get all the links leading to the next page?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭