无法使用ScrapySharp抓取网页数据 [英] Cannot Crawl Web Page Data Using ScrapySharp

查看:95
本文介绍了无法使用ScrapySharp抓取网页数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

 

大家好,

我正面临技术问题。

我浏览了几篇文章以找到答案,但我无法从任何网站上得到正确答案



我正在使用ScrapySharp为我的项目抓取网页数据。

当我尝试从

http://edition.cnn.com/POLITICS网站抓取数据时出现此问题。

首先,我通过IE加载页面,然后我选择了Developer工具来检查标签。

在我选择标签之后,我需要代码"// div [@ class ='cd__content']",

此外,当我通过ScrapySharp加载上述网页时

ScrapingBrowser browser = new ScrapingBrowser();
WebPage rootPage = browser.NavigateToPageAsync(new Uri(url));
HtmlNodeCollection rootNodes = rootPage.Html.SelectNodes("// div [@ class ='cd__content']");

rootNodes的结果显示为null

当我深入调查时,我看到的是上面提到的cd__content在

"SECTION"中当页面加载"SECTION"标签时标记为空。

但是当我通过IE或Chrome检查时,所有标签都填充了信息

这就是我能够选择元素的原因,

但是当我以编程方式加载页面时,它不会。

我的问题是,如何使用ScrapySharp加载填充所有信息的页面

。专家,请帮忙。






解决方案

可能它不是包含所有信息的静态页面,而是由浏览器使用页面中包含的代码(JavaScript)动态生成的。目前您只需
即可获得页面的初始状态。要构建页面,您需要一个执行脚本的引擎。


检查第三方库是否能够执行此操作。 (询问作者和相关社区)。也许你应该在循环中等待一段时间,分析数据直到出现
内容。


顺便提一下,还要考虑以XML格式获取信息的官方方式:
http://edition.cnn.com/services/rss/




Hi all, I am facing a technical issue.

I browsed several articles to find the answer but I couldn’t get a proper answer

from any web site. I am using ScrapySharp for my project to crawl web page data.

This issue came when I try to crawl data from the

http://edition.cnn.com/POLITICS website. Firstly, I loaded the page via IE, and I selected Developer tools to inspect the tags.

After the I selected the tag what I need for my code "//div[@class='cd__content']",

Moreover when I load the above mentioned web page through ScrapySharp ScrapingBrowser browser = new ScrapingBrowser(); WebPage rootPage = browser.NavigateToPageAsync(new Uri(url)); HtmlNodeCollection rootNodes = rootPage.Html.SelectNodes("//div[@class='cd__content']"); The result for rootNodes shows as null When I investigate deep, What I saw is the above-mentioned cd__content is inside the

"SECTION" tag when the page loads the "SECTION" tag is empty.

But when I Inspect via IE or Chrome all tags are filled with information

that’s why I could able to pick the element,

but when I load the page programmatically it won’t. My question is, how can I load the page with filling all information

using ScrapySharp. Experts, Please help on this.



解决方案

Probably it is not a static page that contains all of information, but is made dynamically by browsers using the code (JavaScript) included into the page. Currently you only get the initial state of the page. To build the page, you need an engine that executes the scripts.

Check if the third-party library is able to do this. (Ask the authors and the associated community). Maybe you should wait some time in a loop, analysing the data until the content appears.

By the way, also consider an official way of getting the information in XML format: http://edition.cnn.com/services/rss/



这篇关于无法使用ScrapySharp抓取网页数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆