延迟后的屏幕抓取网页 [英] Screen scraping web page after delay

查看：24 发布时间：2021/12/17 14:08:17 c# c#-4.0 screen-scraping web-scraping

本文介绍了延迟后的屏幕抓取网页的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用 C# 抓取网页，但是在页面加载后，它会执行一些 JavaScript，将更多元素加载到我需要抓取的 DOM 中.一个标准的爬虫程序只是在加载时抓取页面的 html，而不是通过 JavaScript 获取 DOM 更改.我如何放入某种功能以等待一两秒钟然后获取源代码?

I'm trying to scrape a web page using C#, however after the page loads, it executes some JavaScript which loads more elements into the DOM which I need to scrape. A standard scraper simply grabs the html of the page on load and doesn't pick up the DOM changes made via JavaScript. How do I put in some sort of functionality to wait for a second or two and then grab the source?

这是我当前的代码:

private string ScrapeWebpage(string url, DateTime? updateDate)
{
    HttpWebRequest request = null;
    HttpWebResponse response = null;
    Stream responseStream = null;
    StreamReader reader = null;
    string html = null;
    try
    {
        //create request (which supports http compression)
        request = (HttpWebRequest)WebRequest.Create(url);
        request.Pipelined = true;
        request.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip,deflate");
        if (updateDate != null)
            request.IfModifiedSince = updateDate.Value;
        //get response.
        response = (HttpWebResponse)request.GetResponse();
        responseStream = response.GetResponseStream();
        if (response.ContentEncoding.ToLower().Contains("gzip"))
            responseStream = new GZipStream(responseStream,
                CompressionMode.Decompress);
        else if (response.ContentEncoding.ToLower().Contains("deflate"))
            responseStream = new DeflateStream(responseStream,
                CompressionMode.Decompress);
        //read html.
        reader = new StreamReader(responseStream, Encoding.Default);
        html = reader.ReadToEnd();
    }
    catch
    {
        throw;
    }
    finally
    {
        //dispose of objects.
        request = null;
        if (response != null)
        {
            response.Close();
            response = null;
        }
        if (responseStream != null)
        {
            responseStream.Close();
            responseStream.Dispose();
        }
        if (reader != null)
        {
            reader.Close();
            reader.Dispose();
        }
    }
    return html;
}

这是一个示例网址:

http://www.realtor.com/realestateandhomes-search/geneva_ny#listingType-any/pg-4

您会看到页面首次加载时显示找到 134 个列表，然后一秒钟后显示找到 187 个属性.

You'll see when the page first loads it says 134 listings found, then after a second it says 187 properties found.

延迟后的屏幕抓取网页 [英] Screen scraping web page after delay

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录关闭

延迟后的屏幕抓取网页 [英] Screen scraping web page after delay

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录 关闭

登录关闭