最佳practics的并行化网络爬虫在.NET 4.0中 [英] Best practics for parallelize web crawler in .net 4.0

查看：160 发布时间：2016/10/8 14:12:00 c# web-crawler

本文介绍了最佳practics的并行化网络爬虫在.NET 4.0中的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要通过代理来下载大量的页面。
什么是构建一个多线程的网络爬虫最佳做法？

I need to download a lot of pages through proxies. What is best practice for building a multi-threaded web crawler?

时Parallel.For\Foreach足够好还是更重CPU任务？

Is Parallel.For\Foreach is good enough or is it better for heavy CPU tasks?

你自己说，下面的代码是什么？

What do you say about following code?

var multyProxy = new MultyProxy();

   multyProxy.LoadProxyList();


   Task[] taskArray = new Task[1000];

        for(int i = 0; i < taskArray.Length; i++)
        {
            taskArray[i] = new Task( (obj) =>
                {                                                             
                       multyProxy.GetPage((string)obj);
                },

            (object)"http://google.com"
            );
            taskArray[i].Start();
        }


   Task.WaitAll(taskArray);

它的工作可怕。这是非常慢的，我不知道为什么。

It's working horribly. It's very slow and I don't know why.

这代码也工作不好。

 System.Threading.Tasks.Parallel.For(0,1000, new System.Threading.Tasks.ParallelOptions(){MaxDegreeOfParallelism=30},loop =>
            {
                 multyProxy.GetPage("http://google.com");
            }
            );

那么我认为我做错了什么。

Well i think that i am doing something wrong.

当我开始我的脚本，它仅在2％-4％使用网络。

When i starting my script it use network only at 2%-4%.

推荐答案

您基本上都采用为IO密集型任务CPU绑定线程 - 即。即使你的并行操作，他们还在使用起来基本上是一个线程池线程，它主要用于CPU绑定操作。

You are basically using up CPU bound threads for IO bound tasks - ie. even though you're parallelizing your operations, they're still using up essentially a ThreadPool thread, which is mainly intended for CPU bound operations.

基本上你需要使用一个下载数据，将其更改为使用IO完成端口异步模式 - 如果你使用的WebRequest，那么BeginGetResponse（）和EndGetResponse（）方法。

Basically you need to use an async pattern for downloading the data to change it to using IO completion ports - if you're using WebRequest, then the BeginGetResponse() and EndGetResponse() methods

我会建议在寻找无扩展做到这一点，例如：

I would suggest looking at Reactive Extensions to do this, eg:

IEnumerable<string> urls = ... get your urls here...;
var results = from url in urls.ToObservable()
             let req = WebRequest.Create(url)
             from rsp in Observable.FromAsyncPattern<WebResponse>(
                  req.BeginGetResponse, req.EndGetResponse)()
             select ExtractResponse(rsp);

其中ExtractResponse可能只是使用StreamReader.ReadToEnd得到字符串的结果，如果这是你后

where ExtractResponse probably just uses a StreamReader.ReadToEnd to get the string results if that's what you're after

您也可以看看使用.Retry然后操作员将很容易让你如果你的连接问题，重试几次等等...

You can also look at using the .Retry operator then which will easily allow you to retry a few times if you get connection issues etc...

这篇关于最佳practics的并行化网络爬虫在.NET 4.0中的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

最佳practics的并行化网络爬虫在.NET 4.0中 [英] Best practics for parallelize web crawler in .net 4.0

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录关闭

最佳practics的并行化网络爬虫在.NET 4.0中 [英] Best practics for parallelize web crawler in .net 4.0

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录 关闭

登录关闭