刮使用HTML敏捷包 [英] Scraping using Html Agility Package
问题描述
我试图使用HtmlAgilityPackage新闻文章链接如下:<一刮数据href=\"http://www.ndtv.com/india-news/vyapam-scam-documents-show-chief-minister-shivraj-chouhan-delayed-probe-780528\" rel=\"nofollow\">http://www.ndtv.com/india-news/vyapam-scam-documents-show-chief-minister-shivraj-chouhan-delayed-probe-780528
I am trying to scrape data from a news article using HtmlAgilityPackage the link is as follows http://www.ndtv.com/india-news/vyapam-scam-documents-show-chief-minister-shivraj-chouhan-delayed-probe-780528
我写了下面的code以下,以提取该文章,但由于某种原因,我的变量aTags正在返回空值的所有注释
I have written the following code below to extract all the comments in this articles but for some reason my variable aTags is returning null value
code:
var getHtmlWeb = new HtmlWeb();
var document = getHtmlWeb.Load(txtinputurl.Text);
var aTags = document.DocumentNode.SelectNodes("//div[@class='com_user_text']");
int counter = 1;
if (aTags != null)
{
foreach (var aTag in aTags)
{
lbloutput.Text += lbloutput.Text + ". " + aTag.InnerHtml + "\t" + "<br />";
counter++;
}
}
我也用这个XPath,但仍是同样的结果// DIV [@类='newcomment_list'] / UL /李/ DIV [@类='headerwrap'] / DIV [@类='com_user_text']
请帮我用正确的XPath来提取所有评论
找遍了网,但没有办法了。
I have also used this XPath but still the same result //div[@class='newcomment_list']/ul/li/div[@class='headerwrap']/div[@class='com_user_text'] Please help me with the correct Xpath to Extract all the comments Searched all over the net but no solution.
推荐答案
执行页面上的查看源文件,然后搜索 com_user_text
。用户评论不会出现在所有。他们在页面加载后通过JavaScript加载。所以,当你通过加载 getHtmlWeb.Load网页内容()
,你没有得到用户的意见。
Do a 'View Source' on the page and search for com_user_text
. The user comments don't appear at all. They are loaded via javascript after the page is loaded. So when you load the page content via getHtmlWeb.Load()
, you don't get user comments.
由于这个答案说,HTML敏捷不能够模拟一个浏览器中运行的JavaScript的工具。相反,你需要像华廷说,允许通过给定的浏览器引擎编程访问网页,将加载完整的文档。
As this answer says, HTML Agility is not a tool capable of emulating a browser and running javascript. Instead, you need something like WatiN that "allows programmatic access to web pages through a given browser engine and will load the full document."
这篇关于刮使用HTML敏捷包的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!