TPL数据流与普通信号灯 [英] TPL Dataflow vs plain Semaphore

查看:59
本文介绍了TPL数据流与普通信号灯的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要进行可扩展的流程.该进程主要具有I/O操作,并具有一些次要的CPU操作(主要是反序列化字符串).该过程在数据库中查询URL列表,然后从这些URL中获取数据,将下载的数据反序列化为对象,然后将其中的某些数据持久化为crm动态文件以及另一个数据库.之后,我需要更新第一个处理了URL的数据库.部分要求是使并行度可配置.

I have a requirement to make a scalable process. The process has mainly I/O operations with some minor CPU operations (mainly deserializing strings). The process query the database for a list of urls, then fetches data from these urls, deserialize the downloaded data to objects, then persist some of the data into crm dynamics and also to another database. Afterwards I need to update the first database which urls were processed. Part of the requirement is to make the parallelism degree configurable.

最初,我想通过一系列等待任务来实现它,并使用Semaphore限制并行度-非常简单.然后,我在这里阅读了@Stephen Cleary的一些帖子和答案,这些帖子和答案建议使用TPL Dataflow,我认为这可能是一个不错的选择.但是,我想确保通过使用Dataflow来使代码复杂化",这是值得的.我还建议使用

Initially I thought to implement it via a sequence of tasks with await and limit the parallelism using Semaphore - quite simple. Then I read a few posts and answers here of @Stephen Cleary which recommends using TPL Dataflow and I thought it could be a good candidate. However I want to make sure I'm "complicating" the code by using Dataflow for a worthy cause. I also got a suggestion to use a ForEachAsync extension method which is also simple to use, however I'm not sure if it won't cause a memory overhead because of the way it partitions the collection.

在这种情况下,TPL Dataflow是一个不错的选择吗?它比Semaphore或ForEachAsync方法更好吗?如果通过TPL DataFlow实现它比其他每个选项(Semaphore/ForEachASync),我实际上可以获得什么好处?

Is TPL Dataflow a good option for such a scenario? How is it better than a Semaphore or the ForEachAsync method - what benefits will I actually gain if I implement it via TPL DataFlow over each of the other options (Semaphore/ForEachASync)?

推荐答案

该进程主要具有IO操作以及一些次要的CPU操作(主要是反序列化字符串).

The process has mainly IO operations with some minor CPU operations (mainly deserializing strings).

几乎就是I/O.除非这些字符串巨大,否则反序列化将不值得并行化.您正在做的CPU工作类型会因为噪音而消失.

That's pretty much just I/O. Unless those strings are huge, the deserialization won't be worth parallelizing. The kind of CPU work you're doing will be lost in the noise.

因此,您将专注于并发异步.

So, you'll want to focus on concurrent asynchrony.

    如您所见,
  • SemaphoreSlim 是为此的标准模式.
  • TPL Dataflow还可以进行并发(异步形式和并行形式).
  • SemaphoreSlim is the standard pattern for this, as you've found.
  • TPL Dataflow can also do concurrency (both asynchronous and parallel forms).

ForEachAsync 可以采用多种形式;请注意,在博客文章中您引用的该方法有 5 个不同的实现,每个实现都是有效的.这里有很多可能用于迭代的语义,每种语义都会导致不同的设计选择和实现."出于您的目的(不希望CPU并行化),您不应该考虑使用 Task.Run 或分区的方法.在异步并发世界中,任何 ForEachAsync 实现都只是语法糖,它隐藏了实现的语义,这就是为什么我倾向于避免它.

ForEachAsync can take several forms; note that in the blog post you referenced, there are 5 different implementations of this method, each of which are valid. "[T]here are many different semantics possible for iteration, and each will result in different design choices and implementations." For your purposes (not wanting CPU parallelization), you shouldn't consider the ones using Task.Run or partitioning. In an asynchronous concurrency world, any ForEachAsync implementation is just going to be syntactic sugar that hides which semantics it implements, which is why I tend to avoid it.

这将使您拥有 SemaphoreSlim ActionBlock 的对比.我通常建议人们首先从 SemaphoreSlim 开始,如果他们的需求变得更加复杂(以某种方式,他们似乎将从数据流管道中受益),则考虑迁移到TPL Dataflow.

This leaves you with SemaphoreSlim vs. ActionBlock. I generally recommend people start with SemaphoreSlim first, and consider moving to TPL Dataflow if their needs become more complex (in a way that seems like they would benefit from a dataflow pipeline).

例如,部分要求是使并行度可配置."

E.g., "Part of the requirement is to make the parallelism degree configurable."

您可以从允许一定程度的并发开始-被限制的事情是一个整体操作(从url获取数据,将下载的数据反序列化为对象,持久化为crm动态和另一个数据库,然后更新第一个数据库).这是 SemaphoreSlim 将是一个完美的解决方案.

You may start off with allowing a degree of concurrency - where the thing being throttled is a single whole operation (fetch data from url, deserialize the downloaded data to objects, persist into crm dynamics and to another database, and update the first database). This is where SemaphoreSlim would be a perfect solution.

但是您可能会决定要使用多个旋钮:例如,对于要下载的URL数量,一个并发度,对于持久化的单独并发度,以及对于更新原始数据库的单独并发度.然后,您还需要限制这些点之间的队列":仅在内存中存储如此多的反序列化对象等,以确保数据库速度较慢的快速url不会对使用过多的应用程序造成问题记忆.如果这些是有用的语义,那么您已经开始从数据流的角度解决该问题,这就是使用TPL Dataflow之类的库可能会更好的服务点.

But you may decide you want to have multiple knobs: say, one degree of concurrency for how many urls you're downloading, and a separate degree of concurrency for persisting, and a separate degree of concurrency for updating the original database. And then you'd also need to limit the "queues" in-between these points: only so many deserialized objects in-memory, etc. - to ensure that fast urls with slow databases don't cause problems with your app using too much memory. If these are useful semantics, then you have started approaching the problem from a dataflow perspective, and that's the point that you may be better served with a library like TPL Dataflow.

这篇关于TPL数据流与普通信号灯的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆