由JavaScript与C#生成的网页刮 [英] Scraping webpage generated by javascript with C#

查看:186
本文介绍了由JavaScript与C#生成的网页刮的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个网页浏览器,并在Visual Studio中的标签,基本上就是我想要做的就是抓住从其他网页的部分。

I have a webBrowser, and a label in Visual Studio, and basically what I'm trying to do is grab a section from another webpage.

我试着使用WebClient.DownloadString和WebClient.DownloadFile,和JavaScript加载内容之前,他们两个人给我的网页的源代码。我的下一个想法是使用web浏览器的工具,只需要调用webBrowser.DocumentText加载后的页面,并没有工作,但它仍然给我的网页的原始来源。

I tried using WebClient.DownloadString and WebClient.DownloadFile, and both of them give me the source code of the webpage before the javascript loads the content. My next idea was to use a WebBrowser tool and just call webBrowser.DocumentText after the page loaded and that did not work, it still gives me the original source of the page.

有没有一种方法,我可以抓取页面javascriptload后?

Is there a way I can grab the page post-javascriptload?

下面是我想刮的页面。

http://www.regulations.gov/#!documentDetail;D = APHIS-2013-0013-0083

我要下车的网页,其中产生的评论。

I need to get the comment off of that page, which is generated.

推荐答案

现在的问题是,浏览器通常执行的JavaScript,并将它与一个更新的DOM结果。除非你能分析JavaScript或拦截它使用的数据,您将需要执行的代码将作为一个浏览器。在我碰到了同样的问题以前,我利用硒和PhantomJS来渲染页面。它呈现页面后,我会使用客户端的webdriver导航DOM和检索我所需要的内容,后期AJAX。

The problem is the browser usually executes the javascript and it results with an updated DOM. Unless you can analyze the javascript or intercept the data it uses, you will need to execute the code as a browser would. In the past I ran into the same issue, I utilized selenium and PhantomJS to render the page. After it renders the page, I would use the WebDriver client to navigate the DOM and retrieve the content I needed, post AJAX.

在一个高层次的,这些步骤如下:

At a high-level, these are the steps:


  1. 安装的硒: http://docs.seleniumhq.org/

  2. 入门硒枢纽作为服务

  3. 下载phantomjs(无头浏览器,可以执行JavaScript): http://phantomjs.org/

  4. 在webdriver的模式启动phantomjs指着硒枢纽

  5. 在我刮的应用程序安装的webdriver客户端的NuGet包:安装封装Selenium.WebDriver

  1. Installed selenium: http://docs.seleniumhq.org/
  2. Started the selenium hub as a service
  3. Downloaded phantomjs (a headless browser, that can execute the javascript): http://phantomjs.org/
  4. Started phantomjs in webdriver mode pointing to the selenium hub
  5. In my scraping application installed the webdriver client nuget package: Install-Package Selenium.WebDriver

下面是phantomjs的webdriver的用法的例子:

Here is an example usage of the phantomjs webdriver:

var options = new PhantomJSOptions();
options.AddAdditionalCapability("IsJavaScriptEnabled",true);

var driver = new RemoteWebDriver( new URI(Configuration.SeleniumServerHub),
                    options.ToCapabilities(),
                    TimeSpan.FromSeconds(3)
                  );
driver.Url = "http://www.regulations.gov/#!documentDetail;D=APHIS-2013-0013-0083";
driver.Navigate();
//the driver can now provide you with what you need (it will execute the script)
//get the source of the page
var source = driver.PageSource;
//fully navigate the dom
var pathElement = driver.FindElementById("some-id");



硒,phantomjs和webdriver的更多信息可以在下面的链接中找到:

More info on selenium, phantomjs and webdriver can be found at the following links:

http://docs.seleniumhq.org/

http://docs.seleniumhq.org/projects/webdriver/

http://phantomjs.org/

编辑:更简单的方法

这似乎有一个的NuGet包为phantomjs,这样你就不需要集线器(我使用集群做以这种方式大规模报废):

It appears there is a nuget package for the phantomjs, such that you don't need the hub (I used a cluster to do massive scrapping in this manner):

安装网络驱动程序:

Install-Package Selenium.WebDriver

安装嵌入exe文件:

Install-Package phantomjs.exe

更新后的代码:

var driver = new PhantomJSDriver();
driver.Url = "http://www.regulations.gov/#!documentDetail;D=APHIS-2013-0013-0083";
driver.Navigate();
//the driver can now provide you with what you need (it will execute the script)
//get the source of the page
var source = driver.PageSource;
//fully navigate the dom
var pathElement = driver.FindElementById("some-id");

这篇关于由JavaScript与C#生成的网页刮的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆