用 C# 抓取 JavaScript 生成的网页 [英] Scraping webpage generated by JavaScript with C#

查看:22
本文介绍了用 C# 抓取 JavaScript 生成的网页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个网络浏览器,在 Visual Studio 中有一个标签,基本上我要做的是从另一个网页中抓取一个部分.

I have a web browser, and a label in Visual Studio, and basically what I'm trying to do is grab a section from another webpage.

我尝试使用 WebClient.DownloadStringWebClient.DownloadFile,它们都在 JavaScript 加载内容之前给了我网页的源代码.我的下一个想法是使用网络浏览器工具并在页面加载后调用 webBrowser.DocumentText 并且不起作用,它仍然为我提供了页面的原始来源.

I tried using WebClient.DownloadString and WebClient.DownloadFile, and both of them give me the source code of the web page before the JavaScript loads the content. My next idea was to use a web browser tool and just call webBrowser.DocumentText after the page loaded and that did not work, it still gives me the original source of the page.

有什么方法可以抓取 JavaScript 加载后的页面?

Is there a way I can grab the page post JavaScript load?

推荐答案

问题是浏览器通常会执行 javascript 并导致 DOM 更新.除非您可以分析 javascript 或拦截它使用的数据,否则您将需要像浏览器一样执行代码.过去我遇到了同样的问题,我使用 selenium 和 PhantomJS 来呈现页面.在它呈现页面后,我将使用 WebDriver 客户端来导航 DOM 并检索我需要的内容,然后发布 AJAX.

The problem is the browser usually executes the javascript and it results with an updated DOM. Unless you can analyze the javascript or intercept the data it uses, you will need to execute the code as a browser would. In the past I ran into the same issue, I utilized selenium and PhantomJS to render the page. After it renders the page, I would use the WebDriver client to navigate the DOM and retrieve the content I needed, post AJAX.

概括来说,步骤如下:

  1. 已安装 selenium:http://docs.seleniumhq.org/
  2. 启动 selenium hub 作为服务
  3. 下载phantomjs(无头浏览器,可以执行javascript):http://phantomjs.org/
  4. 在指向 selenium 集线器的 webdriver 模式下启动 phantomjs
  5. 在我的抓取应用程序中安装了 webdriver 客户端 nuget 包:Install-Package Selenium.WebDriver

以下是 phantomjs webdriver 的示例用法:

Here is an example usage of the phantomjs webdriver:

var options = new PhantomJSOptions();
options.AddAdditionalCapability("IsJavaScriptEnabled",true);

var driver = new RemoteWebDriver( new URI(Configuration.SeleniumServerHub),
                    options.ToCapabilities(),
                    TimeSpan.FromSeconds(3)
                  );
driver.Url = "http://www.regulations.gov/#!documentDetail;D=APHIS-2013-0013-0083";
driver.Navigate();
//the driver can now provide you with what you need (it will execute the script)
//get the source of the page
var source = driver.PageSource;
//fully navigate the dom
var pathElement = driver.FindElementById("some-id");

可以在以下链接中找到有关 selenium、phantomjs 和 webdriver 的更多信息:

More info on selenium, phantomjs and webdriver can be found at the following links:

http://docs.seleniumhq.org/

http://docs.seleniumhq.org/projects/webdriver/

http://phantomjs.org/

更简单的方法

似乎有一个用于 phantomjs 的 nuget 包,因此您不需要集线器(我使用集群以这种方式进行大量报废):

It appears there is a nuget package for the phantomjs, such that you don't need the hub (I used a cluster to do massive scrapping in this manner):

安装网络驱动:

Install-Package Selenium.WebDriver

安装嵌入式exe:

Install-Package phantomjs.exe

更新代码:

var driver = new PhantomJSDriver();
driver.Url = "http://www.regulations.gov/#!documentDetail;D=APHIS-2013-0013-0083";
driver.Navigate();
//the driver can now provide you with what you need (it will execute the script)
//get the source of the page
var source = driver.PageSource;
//fully navigate the dom
var pathElement = driver.FindElementById("some-id");

这篇关于用 C# 抓取 JavaScript 生成的网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆