VB.net使用HtmlAgilityPack从Google搜索中提取链接 [英] VB.net extract links from google-search using HtmlAgilityPack
本文介绍了VB.net使用HtmlAgilityPack从Google搜索中提取链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我现在已经更新了代码以作为测试,我想列出所有带有index.php单词的URL. 但它还会显示其他内容.
I have now updated my code as a test I want to list all URLs that has the word index.php but it also displays other things.
这是我的工作代码:
Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
Dim webClient As New System.Net.WebClient
Dim WebSource As String = webClient.DownloadString("http://www.google.com/search?lr=&cr=countryCA&newwindow=1&hl=fil&as_qdr=all&biw=1366&bih=667&tbs=ctr%3AcountryCA&q=index.php&oq=index.php&gs_l=serp.12..0l10.520034.522335.0.525032.9.9.0.0.0.0.497.3073.1j1j2j0j5.9.0....0...1c.1.25.serp..5.4.884.J4smY262XgY")
RichTextBox1.Text = WebSource
ListBox1.Items.Clear()
Dim htmlDoc As New HtmlAgilityPack.HtmlDocument()
htmlDoc.LoadHtml(WebSource)
For Each link As HtmlNode In htmlDoc.DocumentNode.SelectNodes("//cite")
If link.InnerText.Contains("index.php") Then
ListBox1.Items.Add(link.InnerText)
End If
Next
End Sub
预期的输出只能是上面带有index.php的网站,例如:
expected output should only be websites that has index.php on it, like this:
http://www.site1.com/index.php
http://www.site2.com/index.php
http://www.site3.com/index.php
http://www.site4.com/index.php
http://www.site5.com/index.php
但是问题是它只会停止,直到不包含链接的index.php其他部分.
But the problem is it only stops until index.php other parts of the link are not included.
例如完整的网址是
http://www.site5.com/index.php?test_test=test&test
程序仅显示
http://www.site5.com/index.php
否则它会出现断点,例如
or it would have broken dots like
http://www.site5.com/index.php...test....test
推荐答案
我会使用 HTML Agility Pack strong> 提取以下链接
I would use Html Agility Pack to extract the links as below
Dim links As New List(Of String)()
Dim htmlDoc As New HtmlAgilityPack.HtmlDocument()
htmlDoc.LoadHtml(WebSource)
For Each link As HtmlNode In htmlDoc.DocumentNode.SelectNodes("//a[@href]")
Dim att As HtmlAttribute = link.Attributes("href")
If att.Value.Contains("/forums/") Then
links.Add(att.Value)
End If
Next
如果是google搜索结果,请尝试以下
if it is google search result try something like below
For Each link As HtmlNode In htmlDoc.DocumentNode.SelectNodes("//cite")
If link.InnerText.Contains("index.php") Then
links.Add(link.InnerText)
End If
Next
这篇关于VB.net使用HtmlAgilityPack从Google搜索中提取链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文