html敏捷包url抓取-获取完整的html链接 [英] html agility pack url scraping-- getting full html link

查看:55
本文介绍了html敏捷包url抓取-获取完整的html链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用nuget包中的html敏捷包,以便抓取网页以获取页面上的所有网址.代码如下所示.但是,它在输出中返回给我的方式,链接只是实际网站的扩展,而不是完整的URL链接,例如 http://www.foo/bar/foobar.com .我将得到的只是"/foobar".有没有一种方法可以使用下面的代码获取url的完整链接? 谢谢!

Hi I am using html agility pack from the nuget packages in order to scrape a web page to get all of the urls on the page. The code is shown below. However the way it returns to me in the output the links are just extensions of the actual website but not the full url link like http://www.foo/bar/foobar.com. All I will get is "/foobar". Is there a way to get the full links of the url with the code below? Thanks!

static void Main(string[] args)
    {
        List<string> linksToVisit = ParseLinks("https://www.facebook.com");
    }

public static List<string> ParseLinks(string email)
    {

        WebClient webClient = new WebClient();

        byte[] data = webClient.DownloadData(email);
        string download = Encoding.ASCII.GetString(data);

        HashSet<string> list = new HashSet<string>();

        var doc = new HtmlDocument();
        doc.LoadHtml(download);
        HtmlNodeCollection nodes =    doc.DocumentNode.SelectNodes("//a[@href]");

            foreach (var n in nodes)
            {
                string href = n.Attributes["href"].Value;
                list.Add(href);
            }
        return list.ToList();
    }

推荐答案

您可以检查HREF值是相对URL还是绝对URL. 将链接加载到 Uri 并测试它是否相对将其转换为绝对值将是方法.

You can check the HREF value if it's relative URL or absolute. Load the link into a Uri and test whether it is relative If it relative convert it to absolute will be the way to go.

static void Main(string[] args)
    {
        List<string> linksToVisit = ParseLinks("https://www.facebook.com");
    }

public static List<string> ParseLinks(string urlToCrawl)
    {

        WebClient webClient = new WebClient();

        byte[] data = webClient.DownloadData(urlToCrawl);
        string download = Encoding.ASCII.GetString(data);

        HashSet<string> list = new HashSet<string>();

        var doc = new HtmlDocument();
        doc.LoadHtml(download);
        HtmlNodeCollection nodes =    doc.DocumentNode.SelectNodes("//a[@href]");

            foreach (var n in nodes)
            {
                string href = n.Attributes["href"].Value;
                list.Add(GetAbsoluteUrlString(urlToCrawl, href));
            }
        return list.ToList();
    }

将相对URL转换为绝对URL的功能

static string GetAbsoluteUrlString(string baseUrl, string url)
{
    var uri = new Uri(url, UriKind.RelativeOrAbsolute);
    if (!uri.IsAbsoluteUri)
        uri = new Uri(new Uri(baseUrl), uri);
    return uri.ToString();
}

这篇关于html敏捷包url抓取-获取完整的html链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆