并行请求刮取网站的多个页面 [英] Parallel request to scrape multiple pages of a website
问题描述
我想抓取一个包含大量包含有趣数据的页面的网站,但是由于源非常大,因此我想使用多线程并限制超载.
我使用Parallel.ForEach
启动10个任务的每个块,然后在主for
循环中等待,直到启动的活动线程数降至阈值以下.为此,我使用活动线程的计数器,在使用WebClient
启动新线程时增加,而在触发WebClient
的DownloadStringCompleted
事件时减少.
I want to scrape a website with plenty of pages with interesting data but as the source is very large I want to multithread and limit the overload.
I use a Parallel.ForEach
to start each chunk of 10 tasks and I wait in the main for
loop until the numbers of active threads started drop below a threshold. For that I use a counter of active threads I increment when starting a new thread with a WebClient
and decrement when the DownloadStringCompleted
event of the WebClient
is triggered.
最初的问题是如何使用DownloadStringTaskAsync
而不是DownloadString
并等待在Parallel.ForEach
中启动的每个线程都已完成.解决方法:
主foor循环中的一个计数器(activeThreads
)和一个Thread.Sleep
.
Originally the questions was how to use DownloadStringTaskAsync
instead of DownloadString
and wait that each of the threads started in the Parallel.ForEach
has completed. This has been solved with a workaround:
a counter (activeThreads
) and a Thread.Sleep
in the main foor loop.
是否正在使用await DownloadStringTaskAsync
而不是DownloadString
通过在等待DownloadString数据到达时释放线程来提高速度?
Is using await DownloadStringTaskAsync
instead of DownloadString
supposed to improve at all the speed by freeing a thread while waiting for the DownloadString data to arrive ?
回到最初的问题,有没有一种方法可以使用TPL更优雅地做到这一点,而无需解决涉及计数器的问题?
And to get back to the original question, is there a way to do this more elegantly using TPL without the workaround of involving a counter ?
private static volatile int activeThreads = 0;
public static void RecordData()
{
var nbThreads = 10;
var source = db.ListOfUrls; // Thousands urls
var iterations = source.Length / groupSize;
for (int i = 0; i < iterations; i++)
{
var subList = source.Skip(groupSize* i).Take(groupSize);
Parallel.ForEach(subList, (item) => RecordUri(item));
//I want to wait here until process further data to avoid overload
while (activeThreads > 30) Thread.Sleep(100);
}
}
private static async Task RecordUri(Uri uri)
{
using (WebClient wc = new WebClient())
{
Interlocked.Increment(ref activeThreads);
wc.DownloadStringCompleted += (sender, e) => Interlocked.Decrement(ref iterationsCount);
var jsonData = "";
RootObject root;
jsonData = await wc.DownloadStringTaskAsync(uri);
var root = JsonConvert.DeserializeObject<RootObject>(jsonData);
RecordData(root)
}
}
推荐答案
如果您想要一个优雅的解决方案,则应使用Microsoft的Reactive Framework.简直太简单了:
If you want an elegant solution you should use Microsoft's Reactive Framework. It's dead simple:
var source = db.ListOfUrls; // Thousands urls
var query =
from uri in source.ToObservable()
from jsonData in Observable.Using(
() => new WebClient(),
wc => Observable.FromAsync(() => wc.DownloadStringTaskAsync(uri)))
select new { uri, json = JsonConvert.DeserializeObject<RootObject>(jsonData) };
IDisposable subscription =
query.Subscribe(x =>
{
/* Do something with x.uri && x.json */
});
这就是整个代码.它很好地是多线程的,并且受到控制.
That's the entire code. It's nicely multi-threaded and it's kept under control.
只需NuGet"System.Reactive"即可获取这些位.
Just NuGet "System.Reactive" to get the bits.
这篇关于并行请求刮取网站的多个页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!