HTMLAgilityPack加载AJAX内容以进行抓取 [英] HTMLAgilityPack load AJAX content for scraping

查看:269
本文介绍了HTMLAgilityPack加载AJAX内容以进行抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在c#webforms项目中使用HTMLAgilityPack抓取网页.

Im trying to scrape a webpage using HTMLAgilityPack in a c# webforms project.

我见过的所有解决方案都使用WebBrowser控件.但是,据我确定,这仅在WinForms项目中可用.

All the solutions Ive seen for doing this use a WebBrowser control. However, from what I can determine, this is only available in WinForms projects.

目前,我通过以下代码调用了所需的页面:

At present Im calling the required page via this code:

var getHtmlWeb = new HtmlWeb();
var document = getHtmlWeb.Load(inputUri);
HtmlAgilityPack.HtmlNodeCollection nodes = document.DocumentNode.SelectNodes("//div[@class=\"nav\"]");

我见过的使用WebBrowser控件的示例代码:

An example bit of code that Ive seen saying to use the WebBrowser control:

if (this.webBrowser1.Document.GetElementsByTagName("html")[0] != null)
_htmlAgilityPackDocument.LoadHtml(this.webBrowser1.Document.GetElementsByTagName("html")[0].OuterHtml);

一旦AJAX加载后有关如何抓取页面的任何建议/指针,将不胜感激.

Any suggestions / pointers as to how to grab the page once AJAX has been loaded, will be appreciated.

推荐答案

似乎使用HTMLAgilityPack只能抓取通过html本身加载的内容.因此,通过AJAX加载的任何内容对于HTMLAgilityPack都是不可见的.

It seems that using HTMLAgilityPack it is only possible to scrape content that is loaded via the html itself. Thus anything loaded via AJAX will not be visible to HTMLAgilityPack.

也许最简单的选择(在可行的情况下)是使用基于浏览器的工具(例如Firebug)来确定AJAX加载的数据的来源.然后直接处理源数据.这样做的另一个好处是可以刮取更大的数据集.

Perhaps the easiest option -where feasible- is to use a browser based tool such as Firebug to determine the source of the data loaded by AJAX. Then manipulate the source data directly. An added advantage of this might be the ability to scrape a larger dataset.

这篇关于HTMLAgilityPack加载AJAX内容以进行抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆