C#在jquery之后抓取正确的web内容 [英] C# scrape correct web content following jquery

查看:113
本文介绍了C#在jquery之后抓取正确的web内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在使用HtmlAgilityPack一段时间,但我一直在使用的网络资源现在有一个(似乎)浏览器通过的jQuery协议。我期望加载的是产品页面,但实际加载的内容(由WebBrowser控件和WebClient DownloadString验证)是一个重定向,要求访问者选择一个顾问并注册它们。

I've been using HtmlAgilityPack for awhile but the web resource I have been working with now has a (seems like) jQuery protocol the browser passes through. What I expect to load is a product page but what actually loads (verified by a WebBrowser control, and a WebClient DownloadString) is a redirect, asking the visitor to select a consultant and sign up with them.

换句话说,使用Chrome的Inspect >> Elements工具,我得到:

In other words, using Chrome's Inspect >> Elements tool, I get:

<div data-v-1a7a6550="" class="product-extra-images">
  <img data-v-1a7a6550="" src="https://vw-xerophane.storage.googleapis.com:443/thumbnails/products/10174_1MainImage-White-9-14_1.jpg.100x100_q85_crop_upscale.jpg" width="50">
  <img data-v-1a7a6550="" src="https://vw-xerophane.storage.googleapis.com:443/thumbnails/products/10174_2Image2-White-9-14_1.jpg.100x100_q85_crop_upscale.jpg" width="50">

但WebBrowser和HTMLAgilityPack只能获得:

But WebBrowser and HTMLAgilityPack only get:

<div class="container content">
  <div class="alert alert-danger " role="alert">
    <button type="button" class="close" data-dismiss="alert">
      <span aria-hidden="true">&times;</span>
    </button>
    <h2 style="text-align: center; background: none; padding-bottom: 0;">It looks like you haven't selected a Consultant yet!</h2>
    <p style="text-align: center;"><span>...were you just wanting to browse or were you looking to shop and pick a Consultant to shop under?</span></p>
      <div class="text-center">
        <form action="/just-browsing/" method="POST" class="form-inline">
   ...

在深入研究头部的类定义后,我找到了页面确实使用jQuery来处理正确的加载,并在访问者浏览页面时处理动作(滚动,调整大小,悬停在图像上,选择其他图像等)。这是来自jQuery的负责人:

After digging into the class definitions in the head, I found the page does use jQuery to handle proper loading, and to handle actions (scrolling, resizing, hovering over images, selecting other images, etc) while the visitor browses the page. Here's from the head of the jQuery:

/*!
* jQuery JavaScript Library v2.1.4
* http://jquery.com/
*
* Includes Sizzle.js
* http://sizzlejs.com/
*
* Copyright 2005, 2014 jQuery Foundation, Inc. and other contributors
* Released under the MIT license
* http://jquery.org/license
*
* Date: 2015-04-28T16:01Z
*/

我尝试了ScrapySharp,如下所述:
C#.NET:刮擦动态(JS)网站

I tried ScrapySharp as described here: C# .NET: Scraping dynamic (JS) websites

但这最终耗尽了所有可用内存并且从未生产任何东西。

But that just ended up consuming all available memory and never producing anything.

另外这个:
htmlagilitypack和动态内容问题
如上所述加载了错误的重定向。

Also this: htmlagilitypack and dynamic content issue Loaded the incorrect redirect as noted above.

我可以提供更多我想要提取的源代码,包括完整的jQuery(如果需要)。

I can provide more of the source I'm trying to extract from, including the complete jQuery if needed.

推荐答案

使用 CaptureRedirect = false; 绕过重定向页面。这对我来说对你提到的页面很有用:

Use CaptureRedirect = false; to bypass redirection page. This worked for me with the page you mentioned:

var web = new HtmlWeb();
web.CaptureRedirect = false;
web.BrowserTimeout = TimeSpan.FromSeconds(15);

现在继续尝试,直到在页面上看到文字产品描述

Now keep trying till seeing the text "Product Description" on the page.

var doc = web.LoadFromBrowser(url, html =>
{
    return html.Contains("Product Description");
});

HtmlAgilityPack的Latests版本可以在后台运行浏览器。因此,我们并不需要像ScrapySharp这样的其他库来抓取动态内容。

Latests versions of HtmlAgilityPack can run a browser in background. So we don't really need another library like ScrapySharp for scraping dynamic content.

这篇关于C#在jquery之后抓取正确的web内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆