从谷歌页面构建网络爬虫 [英] building web crawler from google page
问题描述
在搜索具体单词(查询)后,如何获取Html页面中存在的所有HTML链接(如google页面)。
我需要解析才能这样做
可以帮助我吗?
i不知道我必须如何开始
how can i get all HTML links that are exist in Html page like google page after i search for concrete words (query ) .
I need a parsing to do that
can any one help me ?
i don't know how i must begin
推荐答案
您必须首先检索网页HTML文本。这是通过
You must first retrieve the web page HTML text. This is obtained by
Uri uri = <desired uri with query string>;
HAP.HtmlDocument web_page = new HAP.HtmlDocument ( );
try
{
WebRequest web_request;
WebResponse web_response;
web_request = WebRequest.Create ( uri.AbsoluteUri );
web_response = web_request.GetResponse ( );
using ( Stream stream = web_response.
GetResponseStream ( ) )
{
web_page.Load ( stream );
}
}
catch ( WebException we )
{
// link is broken;
return;
}
其中HAP被声明为使用HAP = HtmlAgilityPack;。如果你不熟悉HtmlAgilityPack,谷歌它并使用NuGet下载它。
where HAP is declared as "using HAP = HtmlAgilityPack;". If you are unfamiliar with the HtmlAgilityPack, Google it and use NuGet to download it.
此代码基本上提交请求(包括查询字符串)并加载带有响应的HtmlAgilityPack文档。注意这需要时间来完成。
This code basically submits the request (including a query string) and loads an HtmlAgilityPack document with the response. Note this takes time to accomplish.
然后,使用HtmlAgilityPack的设施,
Then, using the facilities of HtmlAgilityPack,
// **************************************** extract_hyperlinks
void extract_hyperlinks ( HAP.HtmlDocument web_page,
Uri root )
{
HAP.HtmlNodeCollection hyperlinks;
hyperlinks = web_page.DocumentNode.
SelectNodes ( @"//a[@href]" );
if ( ( hyperlinks == null ) || ( hyperlinks.Count == 0 ) )
{
return;
}
foreach ( HAP.HtmlNode hyperlink in hyperlinks )
{
HAP.HtmlAttribute attribute;
string destination;
Uri next_uri;
attribute = hyperlink.Attributes [ "href" ];
if ( attribute == null )
{
continue;
}
destination = attribute.Value;
if ( destination.StartsWith (
"javascript",
StringComparison.
InvariantCultureIgnoreCase ) )
{
// ignore javascript on
// buttons using a tags
continue;
}
if ( destination.IndexOf ( "#" ) >= 0 )
{
// ignore fragments
continue;
}
next_uri = new Uri ( root, destination );
if ( next_uri == null )
{
continue;
}
// Make it absolute if it's
// relative
if ( !next_uri.IsAbsoluteUri )
{
next_uri = new Uri ( root,
next_uri.AbsoluteUri );
}
if ( !root.IsBaseOf ( next_uri ) )
{
continue;
}
/////////// next_uri contains a hyperlink in the HTML document
}
}
基本上,这段代码首先将具有href属性的所有超链接分配给变量超链接。请注意,要求href属性会消除所有名称和ID锚点。然后,对于每个超链接,该方法进行测试以确保其目标不是片段或JavaScript引用。目标将转换为URI(next_uri),然后根据需要转换为绝对URI。如果next_uri不是根URI的子节点,则忽略它。最后,next_uri在文档中包含一个超链接。
Basically, this code first assigns all hyperlinks with a href attribute to the variable hyperlinks. Note that requiring a href attribute eliminates all name and id anchors. Then for each hyperlink, the method tests to insure that its destination is not a fragment or a JavaScript reference. The destination is converted to a URI (next_uri) and then, if necessary, converted to an absolute URI. If next_uri is not a child of the root URI, it is ignored. At the end, next_uri contains a hyperlink in the document.
我正在写作一篇介绍网站Mapper的文章。当它发布时你可能会感兴趣。
I am in the midst of writing an article that presents a Web Site Mapper. When it's published you may find it of interest.
这篇关于从谷歌页面构建网络爬虫的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!