如何使用 HtmlAgilityPack - C# 提取完整的 url [英] How to extract full url with HtmlAgilityPack - C#
问题描述
好的,下面的方法只提取这样的引用网址
Alright with the way below it is extracting only referring url like this
提取码:
foreach (HtmlNode link in hdDoc.DocumentNode.SelectNodes("//a[@href]"))
{
lsLinks.Add(link.Attributes["href"].Value.ToString());
}
网址代码
<a href="Login.aspx">Login</a>
提取的网址
Login.aspx
但我想获得浏览器解析的真实链接
But i want to get real link what browser parsed like
http://www.monstermmorpg.com/Login.aspx
我可以通过检查 url 是否包含 http 来实现,如果不添加域值,但在某些情况下可能会导致一些问题,我认为这不是一个非常明智的解决方案.
I can do it with checking the url whether containing http and if not add the domain value but it may cause some problems at some occasions and i think not a very wise solution.
c# 4.0 , HtmlAgilityPack.1.4.0
c# 4.0 , HtmlAgilityPack.1.4.0
推荐答案
假设您有原始 url,您可以将解析后的 url 组合成这样:
Assuming you have the original url, you can combine the parsed url something like this:
// The address of the page you crawled
var baseUrl = new Uri("http://example.com/path/to-page/here.aspx");
// root relative
var url = new Uri(baseUrl, "/Login.aspx");
Console.WriteLine (url.AbsoluteUri); // prints 'http://example.com/Logon.aspx'
// relative
url = new Uri(baseUrl, "../foo.aspx?q=1");
Console.WriteLine (url.AbsoluteUri); // prints 'http://example.com/path/foo.aspx?q=1'
// absolute
url = new Uri(baseUrl, "http://stackoverflow.com/questions/7760286/");
Console.WriteLine (url.AbsoluteUri); // prints 'http://stackoverflow.com/questions/7760286/'
// other...
url = new Uri(baseUrl, "javascript:void(0)");
Console.WriteLine (url.AbsoluteUri); // prints 'javascript:void(0)'
注意使用 AbsoluteUri
而不是依赖于 ToString()
因为 ToString
解码 URL(使其更人类可读""),这通常不是您想要的.
Note the use of AbsoluteUri
and not relying on ToString()
because ToString
decodes the URL (to make it more "human-readable"), which is not typically what you want.
这篇关于如何使用 HtmlAgilityPack - C# 提取完整的 url的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!