延迟后的屏幕抓取网页 [英] Screen scraping web page after delay

查看:24
本文介绍了延迟后的屏幕抓取网页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 C# 抓取网页,但是在页面加载后,它会执行一些 JavaScript,将更多元素加载到我需要抓取的 DOM 中.一个标准的爬虫程序只是在加载时抓取页面的 html,而不是通过 JavaScript 获取 DOM 更改.我如何放入某种功能以等待一两秒钟然后获取源代码?

I'm trying to scrape a web page using C#, however after the page loads, it executes some JavaScript which loads more elements into the DOM which I need to scrape. A standard scraper simply grabs the html of the page on load and doesn't pick up the DOM changes made via JavaScript. How do I put in some sort of functionality to wait for a second or two and then grab the source?

这是我当前的代码:

private string ScrapeWebpage(string url, DateTime? updateDate)
{
    HttpWebRequest request = null;
    HttpWebResponse response = null;
    Stream responseStream = null;
    StreamReader reader = null;
    string html = null;
    try
    {
        //create request (which supports http compression)
        request = (HttpWebRequest)WebRequest.Create(url);
        request.Pipelined = true;
        request.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip,deflate");
        if (updateDate != null)
            request.IfModifiedSince = updateDate.Value;
        //get response.
        response = (HttpWebResponse)request.GetResponse();
        responseStream = response.GetResponseStream();
        if (response.ContentEncoding.ToLower().Contains("gzip"))
            responseStream = new GZipStream(responseStream,
                CompressionMode.Decompress);
        else if (response.ContentEncoding.ToLower().Contains("deflate"))
            responseStream = new DeflateStream(responseStream,
                CompressionMode.Decompress);
        //read html.
        reader = new StreamReader(responseStream, Encoding.Default);
        html = reader.ReadToEnd();
    }
    catch
    {
        throw;
    }
    finally
    {
        //dispose of objects.
        request = null;
        if (response != null)
        {
            response.Close();
            response = null;
        }
        if (responseStream != null)
        {
            responseStream.Close();
            responseStream.Dispose();
        }
        if (reader != null)
        {
            reader.Close();
            reader.Dispose();
        }
    }
    return html;
}

这是一个示例网址:

http://www.realtor.com/realestateandhomes-search/geneva_ny#listingType-any/pg-4

您会看到页面首次加载时显示找到 134 个列表,然后一秒钟后显示找到 187 个属性.

You'll see when the page first loads it says 134 listings found, then after a second it says 187 properties found.

推荐答案

为了执行 JavaScript,我使用 webkit 来呈现页面,这是 Chrome 和 Safari 使用的引擎.这里是一个使用其 Python 绑定的示例.

To execute the JavaScript I use webkit to render the page, which is the engine used by Chrome and Safari. Here is an example using its Python bindings.

Webkit 也有 .NET 绑定,但我没用过.

Webkit also has .NET bindings but I haven't used them.

这篇关于延迟后的屏幕抓取网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆