c# .net 4.5 异步/多线程? [英] c# .net 4.5 async / multithread?

查看:26
本文介绍了c# .net 4.5 异步/多线程?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个从网页中抓取数据的 C# 控制台应用程序.

I'm writing a C# console application that scrapes data from web pages.

此应用程序将访问大约 8000 个网页并抓取数据(每个页面上的数据格式相同).

This application will go to about 8000 web pages and scrape data(same format of data on each page).

我现在可以在没有异步方法和多线程的情况下工作.

I have it working right now with no async methods and no multithreading.

但是,我需要它更快.它只使用大约 3%-6% 的 CPU,我认为是因为它花时间等待下载 html.(WebClient.DownloadString(url))

However, I need it to be faster. It only uses about 3%-6% of the CPU, I think because it spends the time waiting to download the html.(WebClient.DownloadString(url))

这是我程序的基本流程

DataSet alldata;

foreach(var url in the8000urls)
{
    // ScrapeData downloads the html from the url with WebClient.DownloadString
    // and scrapes the data into several datatables which it returns as a dataset.
    DataSet dataForOnePage = ScrapeData(url);

    //merge each table in dataForOnePage into allData
}

// PushAllDataToSql(alldata);

我一直在尝试多线程,但不确定如何正确开始.我正在使用 .net 4.5,我的理解是 async 和 await 在 4.5 中使编程更容易,但我仍然有点迷茫.

Ive been trying to multi thread this but am not sure how to properly get started. I'm using .net 4.5 and my understanding is async and await in 4.5 are made to make this much easier to program but I'm still a little lost.

我的想法是继续为这条线创建异步的新线程

My idea was to just keep making new threads that are async for this line

DataSet dataForOnePage = ScrapeData(url);

然后当每个完成时,运行

and then as each one finishes, run

//merge each table in dataForOnePage into allData

谁能指出我如何在 .net 4.5 c# 中使该行异步,然后让我的合并方法完整运行的正确方向?

Can anyone point me in the right direction on how to make that line async in .net 4.5 c# and then have my merge method run on complete?

谢谢.

这是我的 ScrapeData 方法:

Here is my ScrapeData method:

public static DataSet GetProperyData(CookieAwareWebClient webClient, string pageid)
{
    var dsPageData = new DataSet();

    // DOWNLOAD HTML FOR THE REO PAGE AND LOAD IT INTO AN HTMLDOCUMENT
    string url = @"https://domain.com?&id=" + pageid + @"restofurl";
    string html = webClient.DownloadString(url);
    var doc = new HtmlDocument();
    doc.LoadHtml(html );

    // A BUNCH OF PARSING WITH HTMLAGILITY AND STORING IN dsPageData 
    return dsPageData ;
}

推荐答案

如果您想使用 asyncawait 关键字(虽然您不必这样做,但它们确实使 .NET 4.5 中的事情变得更容易),您会首先要更改您的 ScrapeData 方法以返回 Task instance 使用 async 关键字,如下所示:

If you want to use the async and await keywords (although you don't have to, but they do make things easier in .NET 4.5), you would first want to change your ScrapeData method to return a Task<T> instance using the async keyword, like so:

async Task<DataSet> ScrapeDataAsync(Uri url)
{
    // Create the HttpClientHandler which will handle cookies.
    var handler = new HttpClientHandler();

    // Set cookies on handler.

    // Await on an async call to fetch here, convert to a data
    // set and return.
    var client = new HttpClient(handler);

    // Wait for the HttpResponseMessage.
    HttpResponseMessage response = await client.GetAsync(url);

    // Get the content, await on the string content.
    string content = await response.Content.ReadAsStringAsync();

    // Process content variable here into a data set and return.
    DataSet ds = ...;

    // Return the DataSet, it will return Task<DataSet>.
    return ds;
}

请注意,您可能希望远离 WebClient 类,因为它的异步操作本身不支持 Task..NET 4.5 中更好的选择是 HttpClient.我选择使用上面的 HttpClient.另外,看看 HttpClientHandler,特别是 CookieContainer 属性,您将使用它来随每个请求发送 cookie.

Note that you'll probably want to move away from the WebClient class, as it doesn't support Task<T> inherently in its async operations. A better choice in .NET 4.5 is the HttpClient class. I've chosen to use HttpClient above. Also, take a look at the HttpClientHandler class, specifically the CookieContainer property which you'll use to send cookies with each request.

然而,这意味着您很可能不得不使用 await 关键字来等待另一个异步操作,在这种情况下,这很可能是页面的下载.您必须调整下载数据的调用以使用异步版本并await.

However, this means that you will more than likely have to use the await keyword to wait for another async operation, which in this case, would more than likely be the download of the page. You'll have to tailor your calls that download data to use the asynchronous versions and await on those.

一旦完成,您通常会在其上调用 await,但在这种情况下您不能这样做,因为您会在变量上 await.在这种情况下,您正在运行一个循环,因此每次迭代都会重置变量.在这种情况下,最好将 Task 存储在一个数组中,如下所示:

Once that is complete, you would normally call await on that, but you can't do that in this scenario because you would await on a variable. In this scenario, you are running a loop, so the variable would be reset with each iteration. In this case, it's better to just store the Task<T> in an array like so:

DataSet alldata = ...;

var tasks = new List<Task<DataSet>>();

foreach(var url in the8000urls)
{
    // ScrapeData downloads the html from the url with 
    // WebClient.DownloadString
    // and scrapes the data into several datatables which 
    // it returns as a dataset.
    tasks.Add(ScrapeDataAsync(url));
}

还有把数据合并到allData的问题.为此,您需要调用 ContinueWith 方法 在返回的 Task 实例上执行将数据添加到 allData 的任务:

There is the matter of merging the data into allData. To that end, you want to call the ContinueWith method on the Task<T> instance returned and perform the task of adding the data to allData:

DataSet alldata = ...;

var tasks = new List<Task<DataSet>>();

foreach(var url in the8000urls)
{
    // ScrapeData downloads the html from the url with 
    // WebClient.DownloadString
    // and scrapes the data into several datatables which 
    // it returns as a dataset.
    tasks.Add(ScrapeDataAsync(url).ContinueWith(t => {
        // Lock access to the data set, since this is
        // async now.
        lock (allData)
        {
             // Add the data.
        }
    });
}

然后,您可以使用 WhenAll 方法Taskawait :

Then, you can wait on all the tasks using the WhenAll method on the Task class and await on that:

// After your loop.
await Task.WhenAll(tasks);

// Process allData

但是,请注意您有一个 foreach,而 WhenAll 需要一个 IEnumerable 实现.这是一个很好的指标,表明它适合使用 LINQ,它是:

However, note that you have a foreach, and WhenAll takes an IEnumerable<T> implementation. This is a good indicator that this is suitable to use LINQ, which it is:

DataSet alldata;

var tasks = 
    from url in the8000Urls
    select ScrapeDataAsync(url).ContinueWith(t => {
        // Lock access to the data set, since this is
        // async now.
        lock (allData)
        {
             // Add the data.
        }
    });

await Task.WhenAll(tasks);

// Process allData

如果你愿意,你也可以选择不使用查询语法,在这种情况下没有关系.

You can also choose not to use query syntax if you wish, it doesn't matter in this case.

请注意,如果包含方法未标记为 async(因为您在控制台应用程序中并且必须在应用程序终止之前等待结果),那么您可以简单地调用 Wait 方法 Task> 调用 WhenAll 时返回:

Note that if the containing method is not marked as async (because you are in a console application and have to wait for the results before the app terminates) then you can simply call the Wait method on the Task returned when you call WhenAll:

// This will block, waiting for all tasks to complete, all
// tasks will run asynchronously and when all are done, then the
// code will continue to execute.
Task.WhenAll(tasks).Wait();

// Process allData.

也就是说,重点是,您希望将 Task 实例收集到一个序列中,然后在处理 allData 之前等待整个序列.

Namely, the point is, you want to collect your Task instances into a sequence and then wait on the entire sequence before you process allData.

但是,如果可以,我建议在将数据合并到 allData 之前尝试处理数据;除非数据处理需要整个 DataSet,否则通过处理何时返回的尽可能多的数据,您将获得更多的性能提升你拿回来,而不是等待它全部回来.

However, I'd suggest trying to process the data before merging it into allData if you can; unless the data processing requires the entire DataSet, you'll get even more performance gains by processing the as much of the data you get back when you get it back, as opposed to waiting for it all to get back.

这篇关于c# .net 4.5 异步/多线程?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆