从谷歌页面构建网络爬虫 [英] building web crawler from google page

查看：101 发布时间：2019/6/14 0:39:51 C#

本文介绍了从谷歌页面构建网络爬虫的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在搜索具体单词（查询）后，如何获取Html页面中存在的所有HTML链接（如google页面）。

我需要解析才能这样做

可以帮助我吗？

i不知道我必须如何开始

how can i get all HTML links that are exist in Html page like google page after i search for concrete words (query ) .
I need a parsing to do that
can any one help me ?
i don't know how i must begin

推荐答案

您必须首先检索网页HTML文本。这是通过

You must first retrieve the web page HTML text. This is obtained by

Uri               uri = <desired uri with query string>;
HAP.HtmlDocument  web_page = new HAP.HtmlDocument ( );

try
    {
    WebRequest  web_request;
    WebResponse web_response;

    web_request = WebRequest.Create ( uri.AbsoluteUri );
    web_response = web_request.GetResponse ( );

    using ( Stream stream = web_response.
                            GetResponseStream ( ) )
        {
        web_page.Load ( stream );
        }
    }
catch ( WebException we )
    {
    // link is broken;
    return;
    }

其中HAP被声明为使用HAP = HtmlAgilityPack;。如果你不熟悉HtmlAgilityPack，谷歌它并使用NuGet下载它。

where HAP is declared as "using HAP = HtmlAgilityPack;". If you are unfamiliar with the HtmlAgilityPack, Google it and use NuGet to download it.

此代码基本上提交请求（包括查询字符串）并加载带有响应的HtmlAgilityPack文档。注意这需要时间来完成。

This code basically submits the request (including a query string) and loads an HtmlAgilityPack document with the response. Note this takes time to accomplish.

然后，使用HtmlAgilityPack的设施，

Then, using the facilities of HtmlAgilityPack,

    // **************************************** extract_hyperlinks

    void extract_hyperlinks ( HAP.HtmlDocument web_page,
                              Uri              root )
        {
        HAP.HtmlNodeCollection  hyperlinks;

        hyperlinks = web_page.DocumentNode.
                              SelectNodes ( @"//a[@href]" );
        if ( ( hyperlinks == null ) || ( hyperlinks.Count == 0 ) )
            {
            return;
            }

        foreach ( HAP.HtmlNode hyperlink in hyperlinks )
            {
            HAP.HtmlAttribute   attribute;
            string              destination;
            Uri                 next_uri;

            attribute = hyperlink.Attributes [ "href" ];

            if ( attribute == null )
                {
                continue;
                }

            destination = attribute.Value;

            if ( destination.StartsWith ( 
                      "javascript", 
                      StringComparison.
                         InvariantCultureIgnoreCase ) )
                {
                            // ignore javascript on 
                            // buttons using a tags
                continue;      
                }

            if ( destination.IndexOf ( "#" ) >= 0 )
                {
                            // ignore fragments
                continue;
                }

            next_uri = new Uri ( root, destination );
            if ( next_uri == null )
                {
                continue;
                }
                            // Make it absolute if it's 
                            // relative
            if ( !next_uri.IsAbsoluteUri )
                {
                next_uri = new Uri ( root, 
                                     next_uri.AbsoluteUri );
                }

            if ( !root.IsBaseOf ( next_uri ) )
                {
                continue;
                }

/////////// next_uri contains a hyperlink in the HTML document
            }
        }

基本上，这段代码首先将具有href属性的所有超链接分配给变量超链接。请注意，要求href属性会消除所有名称和ID锚点。然后，对于每个超链接，该方法进行测试以确保其目标不是片段或JavaScript引用。目标将转换为URI（next_uri），然后根据需要转换为绝对URI。如果next_uri不是根URI的子节点，则忽略它。最后，next_uri在文档中包含一个超链接。

Basically, this code first assigns all hyperlinks with a href attribute to the variable hyperlinks. Note that requiring a href attribute eliminates all name and id anchors. Then for each hyperlink, the method tests to insure that its destination is not a fragment or a JavaScript reference. The destination is converted to a URI (next_uri) and then, if necessary, converted to an absolute URI. If next_uri is not a child of the root URI, it is ignored. At the end, next_uri contains a hyperlink in the document.

我正在写作一篇介绍网站Mapper的文章。当它发布时你可能会感兴趣。

I am in the midst of writing an article that presents a Web Site Mapper. When it's published you may find it of interest.

这篇关于从谷歌页面构建网络爬虫的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从谷歌页面构建网络爬虫 [英] building web crawler from google page

问题描述

推荐答案

相关文章

其他开发语言最新文章

热门教程

热门工具

登录关闭

从谷歌页面构建网络爬虫 [英] building web crawler from google page

问题描述

推荐答案

相关文章

其他开发语言最新文章

热门教程

热门工具

登录 关闭

登录关闭