屏幕延迟后刮网页 [英] Screen scraping web page after delay

查看:142
本文介绍了屏幕延迟后刮网页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想要刮使用C#一个网页,在页面加载后然而,执行它加载更多的元素融入其中,我需要刮DOM一些JavaScript。一个标准的刮板只是劫掠加载页面的HTML和不拿起通过JavaScript所做的DOM变化。 ?如何我把某种功能等待一两秒钟,然后抓住源

I'm trying to scrape a web page using C#, however after the page loads, it executes some javascript which loads more elements into the DOM which I need to scrape. A standard scraper simply grabs the html of the page on load and doesn't pick up the DOM changes made via javascript. How do I put in some sort of functionality to wait for a second or two and then grab the source?

下面是我当前的代码:

private string ScrapeWebpage(string url, DateTime? updateDate)
        {
            HttpWebRequest request = null;
            HttpWebResponse response = null;
            Stream responseStream = null;
            StreamReader reader = null;
            string html = null;

            try
            {
                //create request (which supports http compression)
                request = (HttpWebRequest)WebRequest.Create(url);
                request.Pipelined = true;
                request.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip,deflate");
                if (updateDate != null)
                    request.IfModifiedSince = updateDate.Value;

                //get response.
                response = (HttpWebResponse)request.GetResponse();
                responseStream = response.GetResponseStream();
                if (response.ContentEncoding.ToLower().Contains("gzip"))
                    responseStream = new GZipStream(responseStream, CompressionMode.Decompress);
                else if (response.ContentEncoding.ToLower().Contains("deflate"))
                    responseStream = new DeflateStream(responseStream, CompressionMode.Decompress);

                //read html.
                reader = new StreamReader(responseStream, Encoding.Default);
                html = reader.ReadToEnd();
            }
            catch
            {
                throw;
            }
            finally
            {//dispose of objects.
                request = null;
                if (response != null)
                {
                    response.Close();
                    response = null;
                }
                if (responseStream != null)
                {
                    responseStream.Close();
                    responseStream.Dispose();
                }
                if (reader != null)
                {
                    reader.Close();
                    reader.Dispose();
                }
            }
            return html;
        }

下面是一个简单的网址:

Here is a sample url:

http://www.realtor.com/ realestateandhomes搜索/ geneva_ny#listingType-任何/ PG-4

您会看到当它说发现有134上市的第一次加载页面,然后经过第二它说,发现187的属性。

You'll see when the page first loads it says 134 listings found, then after a second it says 187 properties found.

推荐答案

要执行JavaScript,我使用的WebKit渲染页面,这是所使用的引擎Chrome和Safari。 这里是利用其Python绑定的例子。

To execute the JavaScript I use webkit to render the page, which is the engine used by Chrome and Safari. Here is an example using its Python bindings.

的Webkit也有 .NET绑定,但我没用过它们。

Webkit also has .NET bindings but I haven't used them.

这篇关于屏幕延迟后刮网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆