从谷歌页面构建网络爬虫 [英] building web crawler from google page

查看:101
本文介绍了从谷歌页面构建网络爬虫的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在搜索具体单词(查询)后,如何获取Html页面中存在的所有HTML链接(如google页面)。

我需要解析才能这样做

可以帮助我吗?

i不知道我必须如何开始

how can i get all HTML links that are exist in Html page like google page after i search for concrete words (query ) .
I need a parsing to do that
can any one help me ?
i don't know how i must begin

推荐答案



您必须首先检索网页HTML文本。这是通过


You must first retrieve the web page HTML text. This is obtained by

Uri               uri = <desired uri with query string>;
HAP.HtmlDocument  web_page = new HAP.HtmlDocument ( );

try
    {
    WebRequest  web_request;
    WebResponse web_response;

    web_request = WebRequest.Create ( uri.AbsoluteUri );
    web_response = web_request.GetResponse ( );

    using ( Stream stream = web_response.
                            GetResponseStream ( ) )
        {
        web_page.Load ( stream );
        }
    }
catch ( WebException we )
    {
    // link is broken;
    return;
    }



其中HAP被声明为使用HAP = HtmlAgilityPack;。如果你不熟悉HtmlAgilityPack,谷歌它并使用NuGet下载它。


where HAP is declared as "using HAP = HtmlAgilityPack;". If you are unfamiliar with the HtmlAgilityPack, Google it and use NuGet to download it.



此代码基本上提交请求(包括查询字符串)并加载带有响应的HtmlAgilityPack文档。注意这需要时间来完成。


This code basically submits the request (including a query string) and loads an HtmlAgilityPack document with the response. Note this takes time to accomplish.



然后,使用HtmlAgilityPack的设施,


Then, using the facilities of HtmlAgilityPack,

    // **************************************** extract_hyperlinks

    void extract_hyperlinks ( HAP.HtmlDocument web_page,
                              Uri              root )
        {
        HAP.HtmlNodeCollection  hyperlinks;

        hyperlinks = web_page.DocumentNode.
                              SelectNodes ( @"//a[@href]" );
        if ( ( hyperlinks == null ) || ( hyperlinks.Count == 0 ) )
            {
            return;
            }

        foreach ( HAP.HtmlNode hyperlink in hyperlinks )
            {
            HAP.HtmlAttribute   attribute;
            string              destination;
            Uri                 next_uri;

            attribute = hyperlink.Attributes [ "href" ];

            if ( attribute == null )
                {
                continue;
                }

            destination = attribute.Value;

            if ( destination.StartsWith ( 
                      "javascript", 
                      StringComparison.
                         InvariantCultureIgnoreCase ) )
                {
                            // ignore javascript on 
                            // buttons using a tags
                continue;      
                }

            if ( destination.IndexOf ( "#" ) >= 0 )
                {
                            // ignore fragments
                continue;
                }

            next_uri = new Uri ( root, destination );
            if ( next_uri == null )
                {
                continue;
                }
                            // Make it absolute if it's 
                            // relative
            if ( !next_uri.IsAbsoluteUri )
                {
                next_uri = new Uri ( root, 
                                     next_uri.AbsoluteUri );
                }

            if ( !root.IsBaseOf ( next_uri ) )
                {
                continue;
                }

/////////// next_uri contains a hyperlink in the HTML document
            }
        }



基本上,这段代码首先将具有href属性的所有超链接分配给变量超链接。请注意,要求href属性会消除所有名称和ID锚点。然后,对于每个超链接,该方法进行测试以确保其目标不是片段或JavaScript引用。目标将转换为URI(next_uri),然后根据需要转换为绝对URI。如果next_uri不是根URI的子节点,则忽略它。最后,next_uri在文档中包含一个超链接。


Basically, this code first assigns all hyperlinks with a href attribute to the variable hyperlinks. Note that requiring a href attribute eliminates all name and id anchors. Then for each hyperlink, the method tests to insure that its destination is not a fragment or a JavaScript reference. The destination is converted to a URI (next_uri) and then, if necessary, converted to an absolute URI. If next_uri is not a child of the root URI, it is ignored. At the end, next_uri contains a hyperlink in the document.



我正在写作一篇介绍网站Mapper的文章。当它发布时你可能会感兴趣。


I am in the midst of writing an article that presents a Web Site Mapper. When it's published you may find it of interest.


这篇关于从谷歌页面构建网络爬虫的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆