如何获取通向下一页的所有链接? [英] How to get all the links leading to the next page?
问题描述
我已经在 vba 中编写了一些代码来获取从网页通向下一页的所有链接.下一页链接的最高数量是 255.运行我的脚本,我得到了 6906 个链接内的所有链接.这意味着循环一次又一次地运行,我正在覆盖一些东西.过滤掉重复链接我可以看到有 254 个唯一链接.我的目标不是将最高页码硬编码到迭代链接.这是我正在尝试的内容:
Sub YifyLink()常量链接 = "https://www.yify-torrent.org/search/1080p/"Dim http As New XMLHTTP60, html As New HTMLDocument, htm As New HTMLDocumentDim x As Long, y As Long, item_link as String使用 http.打开GET",链接,错误.发送html.body.innerHTML = .responseText结束于对于每个帖子在 html.getElementsByClassName("pager")(0).getElementsByTagName("a")如果 InStr(post.innerText, "Last") 那么x = Split(Split(post.href, "-")(1), "/")(0)万一下一篇对于 y = 0 到 xitem_link = 链接 &"t-" &y&/"使用 http.打开GET", item_link, False.发送htm.body.innerHTML = .responseText结束于对于每个帖子在 htm.getElementsByClassName("pager")(0).getElementsByTagName("a")I = I + 1: Cells(I, 1) = posts.href下一篇文章下一个结束子
链接所在的元素:
1<a href="/search/1080p/t-2/">2</a><a href="/search/1080p/t-3/">3</a><a href="/search/1080p/t-4/">4</a><a href="/search/1080p/t-5/">5</a><a href="/search/1080p/t-6/">6</a><a href="/search/1080p/t-7/">7</a><a href="/search/1080p/t-8/">8</a><a href="/search/1080p/t-9/">9</a><a href="/search/1080p/t-10/">10</a><a href="/search/1080p/t-11/">11</a><a href="/search/1080p/t-12/">12</a><a href="/search/1080p/t-13/">13</a><a href="/search/1080p/t-14/">14</a><a href="/search/1080p/t-15/">15</a><a href="/search/1080p/t-16/">16</a><a href="/search/1080p/t-17/">17</a><a href="/search/1080p/t-18/">18</a><a href="/search/1080p/t-19/">19</a><a href="/search/1080p/t-20/">20</a><a href="/search/1080p/t-21/">21</a><a href="/search/1080p/t-22/">22</a><a href="/search/1080p/t-23/">23</a><a href="/search/1080p/t-2/">下一步</a><a href="/search/1080p/t-255/">最后</a>
我得到的结果(部分):
about:/search/1080p/t-20/关于:/搜索/1080p/t-21/关于:/搜索/1080p/t-22/关于:/搜索/1080p/t-23/关于:/搜索/1080p/t-255/
这个想法应该是在循环中抓取页面并找到要比较的内容,如果不正确,则退出循环.
这可能是,即根据字典检查键,或检查元素是否存在,或任何其他可能特定于您的问题的逻辑.
例如,您的问题是,该站点一直为后面的页面显示第 255 页.所以这对我们来说是一个线索.我们可以将属于页面 (n) 的元素与属于页面 (n-1) 的元素进行比较.
例如,如果第 256 页中的元素与第 255 页中的元素相同,则退出循环/子.请参阅下面的示例代码:
Sub yify()const mlink = "https://www.yify-torrent.org/search/1080p/t-"Dim http As New XMLHTTP60, html As New HTMLDocumentDim 帖子作为对象,帖子作为对象Dim pageno As Long,rowno As Long页码 = 1行号 = 1做使用 http.打开GET",链接&佩吉诺&/, 错误的.发送html.body.innerHTML = .responseText结束于设置帖子 = html.getElementsByClassName("mv")If Cells(rowno, 1) = posts(17).getElementsByTagName("a")(0).innerText 然后退出执行对于每个帖子 在帖子中使用 post.getElementsByTagName("div")如果 .Length 那么行号 = 行号 + 1单元格(rowno, 1) = .Item(0).innerText万一结束于下一篇Debug.Print "pageno: " &佩吉诺&"完全的."页码 = 页码 + 1环形结束子
I've written some code in vba to get all the links leading to the next page from a webpage. The highest number of next page links is 255. Running my script, I get all the links within 6906 links. That means the loop runs again and again and I'm overwriting stuffs. Filtering out duplicate links I could see that 254 unique links are there. My objective here is not to hardcode the highest page number to the link for iteration. Here is what I'm trying with:
Sub YifyLink()
Const link = "https://www.yify-torrent.org/search/1080p/"
Dim http As New XMLHTTP60, html As New HTMLDocument, htm As New HTMLDocument
Dim x As Long, y As Long, item_link as String
With http
.Open "GET", link, False
.send
html.body.innerHTML = .responseText
End With
For Each post In html.getElementsByClassName("pager")(0).getElementsByTagName("a")
If InStr(post.innerText, "Last") Then
x = Split(Split(post.href, "-")(1), "/")(0)
End If
Next post
For y = 0 To x
item_link = link & "t-" & y & "/"
With http
.Open "GET", item_link, False
.send
htm.body.innerHTML = .responseText
End With
For Each posts In htm.getElementsByClassName("pager")(0).getElementsByTagName("a")
I = I + 1: Cells(I, 1) = posts.href
Next posts
Next y
End Sub
Elements within which the links are:
<div class="pager"><a href="/search/1080p/" class="current">1</a> <a href="/search/1080p/t-2/">2</a> <a href="/search/1080p/t-3/">3</a> <a href="/search/1080p/t-4/">4</a> <a href="/search/1080p/t-5/">5</a> <a href="/search/1080p/t-6/">6</a> <a href="/search/1080p/t-7/">7</a> <a href="/search/1080p/t-8/">8</a> <a href="/search/1080p/t-9/">9</a> <a href="/search/1080p/t-10/">10</a> <a href="/search/1080p/t-11/">11</a> <a href="/search/1080p/t-12/">12</a> <a href="/search/1080p/t-13/">13</a> <a href="/search/1080p/t-14/">14</a> <a href="/search/1080p/t-15/">15</a> <a href="/search/1080p/t-16/">16</a> <a href="/search/1080p/t-17/">17</a> <a href="/search/1080p/t-18/">18</a> <a href="/search/1080p/t-19/">19</a> <a href="/search/1080p/t-20/">20</a> <a href="/search/1080p/t-21/">21</a> <a href="/search/1080p/t-22/">22</a> <a href="/search/1080p/t-23/">23</a> <a href="/search/1080p/t-2/">Next</a> <a href="/search/1080p/t-255/">Last</a> </div>
The results I'm getting (partial portion):
about:/search/1080p/t-20/
about:/search/1080p/t-21/
about:/search/1080p/t-22/
about:/search/1080p/t-23/
about:/search/1080p/t-255/
The idea should be to scrape pages in a loop and find something to compare, if not true, then exit loop.
This might be, i.e. checking the key against a dictionary, or checking if elements exits, or any other logic that might be specific to your problem.
For example, here your problem is, the site keeps displaying page 255 for the latter pages. So this is a clue for us. We can compare an element that belongs to page (n) with an element that belongs to page (n-1).
For instance, if element in page 256 is the same as element in page 255, then exit loop/sub. Please see the sample code below:
Sub yify()
Const mlink = "https://www.yify-torrent.org/search/1080p/t-"
Dim http As New XMLHTTP60, html As New HTMLDocument
Dim post As Object, posts As Object
Dim pageno As Long, rowno As Long
pageno = 1
rowno = 1
Do
With http
.Open "GET", mlink & pageno & "/", False
.send
html.body.innerHTML = .responseText
End With
Set posts = html.getElementsByClassName("mv")
If Cells(rowno, 1) = posts(17).getElementsByTagName("a")(0).innerText Then Exit Do
For Each post In posts
With post.getElementsByTagName("div")
If .Length Then
rowno = rowno + 1
Cells(rowno, 1) = .Item(0).innerText
End If
End With
Next post
Debug.Print "pageno: " & pageno & " completed."
pageno = pageno + 1
Loop
End Sub
这篇关于如何获取通向下一页的所有链接?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!