网页的C#的下载质量 [英] Mass Downloading of Webpages C#

查看:165
本文介绍了网页的C#的下载质量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的应用程序要求我下载了大量的网页到内存中以便进一步分析和处理。什么是做的最快方法? 。我现在的方法(如下图所示)似乎是过于缓慢,有时会导致超时

 的for(int i = 1; I< =页;我++)
{
串page_specific_link = baseURL时+&放大器;页=+ i.ToString();


{
WebClient的客户端=新的WebClient();
VAR pagesource = client.DownloadString(page_specific_link);
client.Dispose();
sourcelist.Add(pagesource);
}
赶上(例外)
{
}
}


解决方案

您解决这个问题将会在很大程度上取决于你想要多少页下载,而你有多少网站引用的方式。



我会用好圆一些像1000。如果你想下载从一个单一的网站,许多网页,它会采取了很多比,如果你想下载跨数十或数百个网站摊开1,000页长。原因是,如果你打一个单一的网站有一大堆的并发请求,你可能最终得到阻止。



所以,你必须实现的类型礼貌的政策,即发出对一个网站的多个请求之间的延迟。该延迟的长度取决于一些事情。如果网站的robots.txt文件中有一个抓取延迟项,你应该尊重这一点。如果他们不想让你每分钟访问多个页面做,那么这就是快,你应该爬行。如果没有抓取延迟,您应该立足您的需要一个网站响应多长时间的延迟。例如,如果你能在500毫秒的网站下载一个页面,您将延迟X.如果需要一个完整的第二,设置延时2倍。你也许可以封顶您延迟到60秒(除非抓取延迟是更长的时间),我会建议您设置的5至10秒的最小延迟。



我不推荐使用 Parallel.ForEach 这一点。我的测试表明,它不会做一个好工作。有时过度征税的连接,并常常不允许足够的并发连接。我反而创建 Web客户端实例的队列,然后写类似:

  //创建WebClient的情况下
BlockingCollection<的队列; WebClient的> ClientQueue =新BlockingCollection<&的WebClient GT;();
//初始化队列一定数量的WebClient实例

//现在处理的URL
的foreach(在urls_to_download VAR URL)
{
VAR工人= ClientQueue.Take();
worker.DownloadStringAsync(URL,...);
}

在初始化 Web客户端即进入队列实例,设置其 OnDownloadStringCompleted 事件处理程序指向一个完整的事件处理程序。该处理程序应保存字符串到一个文件(或许你应该使用 DownloadFileAsync ),然后在客户端,的将自身添加回 ClientQueue



在我的测试中,我已经能够支持这种方法10到15个并发连接。任何比这更多,我遇到DNS解析(`DownloadStringAsync'没有做异步DNS解析)的问题。你可以得到更多的连接,但这样做了很多工作。



这是我已经采取了过去的做法,它的工作非常出色下载上千网页快。这绝对不是我带着我的高性能网络爬虫的办法,虽然。



我也应该注意,在资源上的巨大的区别代码这两个块之间的用法:

  WebClient的MyWebClient =新的WebClient(); 
的foreach(在urls_to_download VAR URL)
{
MyWebClient.DownloadString(URL);
}

---------------

的foreach(在urls_to_download VAR URL)
{
WebClient的MyWebClient =新的WebClient();
MyWebClient.DownloadString(URL);
}

第一个分配一个单独的 Web客户端实例用于所有请求。第二分配一个 Web客户端为每个请求。所不同的是巨大的。 Web客户端使用了大量的系统资源,并在较短的时间分配好几千条是要影响性能。相信我......我碰到这一点。你最好只分配10或20 Web客户端 S(多达你需要的并发处理),而不是分配每个请求之一。


My application requires that I download a large amount of webpages into memory for further parsing and processing. What is the fastest way to do it? My current method (shown below) seems to be too slow and occasionally results in timeouts.

for (int i = 1; i<=pages; i++)
{
    string page_specific_link = baseurl + "&page=" + i.ToString();

    try
    {    
        WebClient client = new WebClient();
        var pagesource = client.DownloadString(page_specific_link);
        client.Dispose();
        sourcelist.Add(pagesource);
    }
    catch (Exception)
    {
    }
}

解决方案

The way you approach this problem is going to depend very much on how many pages you want to download, and how many sites you're referencing.

I'll use a good round number like 1,000. If you want to download that many pages from a single site, it's going to take a lot longer than if you want to download 1,000 pages that are spread out across dozens or hundreds of sites. The reason is that if you hit a single site with a whole bunch of concurrent requests, you'll probably end up getting blocked.

So you have to implement a type of "politeness policy," that issues a delay between multiple requests on a single site. The length of that delay depends on a number of things. If the site's robots.txt file has a crawl-delay entry, you should respect that. If they don't want you accessing more than one page per minute, then that's as fast as you should crawl. If there's no crawl-delay, you should base your delay on how long it takes a site to respond. For example, if you can download a page from the site in 500 milliseconds, you set your delay to X. If it takes a full second, set your delay to 2X. You can probably cap your delay to 60 seconds (unless crawl-delay is longer), and I would recommend that you set a minimum delay of 5 to 10 seconds.

I wouldn't recommend using Parallel.ForEach for this. My testing has shown that it doesn't do a good job. Sometimes it over-taxes the connection and often it doesn't allow enough concurrent connections. I would instead create a queue of WebClient instances and then write something like:

// Create queue of WebClient instances
BlockingCollection<WebClient> ClientQueue = new BlockingCollection<WebClient>();
// Initialize queue with some number of WebClient instances

// now process urls
foreach (var url in urls_to_download)
{
    var worker = ClientQueue.Take();
    worker.DownloadStringAsync(url, ...);
}

When you initialize the WebClient instances that go into the queue, set their OnDownloadStringCompleted event handlers to point to a completed event handler. That handler should save the string to a file (or perhaps you should just use DownloadFileAsync), and then the client, adds itself back to the ClientQueue.

In my testing, I've been able to support 10 to 15 concurrent connections with this method. Any more than that and I run into problems with DNS resolution (`DownloadStringAsync' doesn't do the DNS resolution asynchronously). You can get more connections, but doing so is a lot of work.

That's the approach I've taken in the past, and it's worked very well for downloading thousands of pages quickly. It's definitely not the approach I took with my high performance Web crawler, though.

I should also note that there is a huge difference in resource usage between these two blocks of code:

WebClient MyWebClient = new WebClient();
foreach (var url in urls_to_download)
{
    MyWebClient.DownloadString(url);
}

---------------

foreach (var url in urls_to_download)
{
    WebClient MyWebClient = new WebClient();
    MyWebClient.DownloadString(url);
}

The first allocates a single WebClient instance that is used for all requests. The second allocates one WebClient for each request. The difference is huge. WebClient uses a lot of system resources, and allocating thousands of them in a relatively short time is going to impact performance. Believe me ... I've run into this. You're better off allocating just 10 or 20 WebClients (as many as you need for concurrent processing), rather than allocating one per request.

这篇关于网页的C#的下载质量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆