刮使用HTML敏捷包 [英] Scraping using Html Agility Package

查看:140
本文介绍了刮使用HTML敏捷包的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图使用HtmlAgilityPackage新闻文章链接如下:<一刮数据href=\"http://www.ndtv.com/india-news/vyapam-scam-documents-show-chief-minister-shivraj-chouhan-delayed-probe-780528\" rel=\"nofollow\">http://www.ndtv.com/india-news/vyapam-scam-documents-show-chief-minister-shivraj-chouhan-delayed-probe-780528

I am trying to scrape data from a news article using HtmlAgilityPackage the link is as follows http://www.ndtv.com/india-news/vyapam-scam-documents-show-chief-minister-shivraj-chouhan-delayed-probe-780528

我写了下面的code以下,以提取该文章,但由于某种原因,我的变量aTags正在返回空值的所有注释

I have written the following code below to extract all the comments in this articles but for some reason my variable aTags is returning null value

code:

var getHtmlWeb = new HtmlWeb();
        var document = getHtmlWeb.Load(txtinputurl.Text);
        var aTags =    document.DocumentNode.SelectNodes("//div[@class='com_user_text']");
        int counter = 1;
        if (aTags != null)
        {
            foreach (var aTag in aTags)
            {
                lbloutput.Text += lbloutput.Text + ". " + aTag.InnerHtml + "\t" + "<br />";
                counter++;
            }
        }

我也用这个XPath,但仍是同样的结果// DIV [@类='newcomment_list'] / UL /李/ DIV [@类='headerwrap'] / DIV [@类='com_user_text']
请帮我用正确的XPath来提取所有评论
找遍了网,但没有办法了。

I have also used this XPath but still the same result //div[@class='newcomment_list']/ul/li/div[@class='headerwrap']/div[@class='com_user_text'] Please help me with the correct Xpath to Extract all the comments Searched all over the net but no solution.

推荐答案

执行页面上的查看源文件,然后搜索 com_user_text 。用户评论不会出现在所有。他们在页面加载后通过JavaScript加载。所以,当你通过加载 getHtmlWeb.Load网页内容(),你没有得到用户的意见。

Do a 'View Source' on the page and search for com_user_text. The user comments don't appear at all. They are loaded via javascript after the page is loaded. So when you load the page content via getHtmlWeb.Load(), you don't get user comments.

由于这个答案说,HTML敏捷不能够模拟一个浏览器中运行的JavaScript的工具。相反,你需要像华廷说,允许通过给定的浏览器引擎编程访问网页,将加载完整的文档。

As this answer says, HTML Agility is not a tool capable of emulating a browser and running javascript. Instead, you need something like WatiN that "allows programmatic access to web pages through a given browser engine and will load the full document."

这篇关于刮使用HTML敏捷包的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆