并行请求刮取网站的多个页面 [英] Parallel request to scrape multiple pages of a website

查看:61
本文介绍了并行请求刮取网站的多个页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想抓取一个包含大量包含有趣数据的页面的网站,但是由于源非常大,因此我想使用多线程并限制超载. 我使用Parallel.ForEach启动10个任务的每个块,然后在主for循环中等待,直到启动的活动线程数降至阈值以下.为此,我使用活动线程的计数器,在使用WebClient启动新线程时增加,而在触发WebClientDownloadStringCompleted事件时减少.

I want to scrape a website with plenty of pages with interesting data but as the source is very large I want to multithread and limit the overload. I use a Parallel.ForEach to start each chunk of 10 tasks and I wait in the main for loop until the numbers of active threads started drop below a threshold. For that I use a counter of active threads I increment when starting a new thread with a WebClient and decrement when the DownloadStringCompleted event of the WebClient is triggered.

最初的问题是如何使用DownloadStringTaskAsync而不是DownloadString并等待在Parallel.ForEach中启动的每个线程都已完成.解决方法: 主foor循环中的一个计数器(activeThreads)和一个Thread.Sleep.

Originally the questions was how to use DownloadStringTaskAsync instead of DownloadString and wait that each of the threads started in the Parallel.ForEach has completed. This has been solved with a workaround: a counter (activeThreads) and a Thread.Sleep in the main foor loop.

是否正在使用await DownloadStringTaskAsync而不是DownloadString通过在等待DownloadString数据到达时释放线程来提高速度?

Is using await DownloadStringTaskAsync instead of DownloadString supposed to improve at all the speed by freeing a thread while waiting for the DownloadString data to arrive ?

回到最初的问题,有没有一种方法可以使用TPL更优雅地做到这一点,而无需解决涉及计数器的问题?

And to get back to the original question, is there a way to do this more elegantly using TPL without the workaround of involving a counter ?

private static volatile int activeThreads = 0;

public static void RecordData()
{
  var nbThreads = 10;
  var source = db.ListOfUrls; // Thousands urls
  var iterations = source.Length / groupSize; 
  for (int i = 0; i < iterations; i++)
  {
    var subList = source.Skip(groupSize* i).Take(groupSize);
    Parallel.ForEach(subList, (item) => RecordUri(item)); 
    //I want to wait here until process further data to avoid overload
    while (activeThreads > 30) Thread.Sleep(100);
  }
}

private static async Task RecordUri(Uri uri)
{
   using (WebClient wc = new WebClient())
   {
      Interlocked.Increment(ref activeThreads);
      wc.DownloadStringCompleted += (sender, e) => Interlocked.Decrement(ref iterationsCount);
      var jsonData = "";
      RootObject root;
      jsonData = await wc.DownloadStringTaskAsync(uri);
      var root = JsonConvert.DeserializeObject<RootObject>(jsonData);
      RecordData(root)
    }
}

推荐答案

如果您想要一个优雅的解决方案,则应使用Microsoft的Reactive Framework.简直太简单了:

If you want an elegant solution you should use Microsoft's Reactive Framework. It's dead simple:

var source = db.ListOfUrls; // Thousands urls

var query =
    from uri in source.ToObservable()
    from jsonData in Observable.Using(
        () => new WebClient(),
        wc => Observable.FromAsync(() => wc.DownloadStringTaskAsync(uri)))
    select new { uri, json = JsonConvert.DeserializeObject<RootObject>(jsonData) };

IDisposable subscription =
    query.Subscribe(x =>
    {
        /* Do something with x.uri && x.json */
    });

这就是整个代码.它很好地是多线程的,并且受到控制.

That's the entire code. It's nicely multi-threaded and it's kept under control.

只需NuGet"System.Reactive"即可获取这些位.

Just NuGet "System.Reactive" to get the bits.

这篇关于并行请求刮取网站的多个页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆