如何PLINQ现有的LINQ查询与加盟? [英] How to PLINQ an existing LINQ query with Joins?

查看:207
本文介绍了如何PLINQ现有的LINQ查询与加盟?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用LINQ来比较两个数据集彼此创造新行和更新现有的。我注意到,在全面比较持续〜1.5小时,只有一两个内核的忙(任务管理器是50-52%的CPU使用率)。我必须承认,我是完全新的并行LINQ,但我认为它可以显著提高性能。

I'm using LINQ to compare two DataSets with each other to create new rows and update existing. I've noticed that the complete comparison lasts ~1,5 hours and only one of the two cores is busy(Task-Manager is 50-52% CPU Usage). I must admit that I'm completely new to parallel LINQ, but I assume that it could increase performance significantly.

所以我的问题是,如何以及我应该并行?

So my question is, how and what should I parallelize?

这些是原始的查询(减少到要领):

These are the original queries(reduced to the essentials):

'check for new data
Dim srcUnique = From row In src.Email_Total
                Select Ticket_ID = row.ticket_id, Interaction = row.interaction, ModifiedAt = row.modified_time

Dim destUnique = From row In dest.ContactDetail
                 Where row.ContactRow.fiContactType = emailContactType.idContactType
                 Select row.ContactRow.Ticket_ID, row.Interaction, row.ModifiedAt

'get all emails(contactdetails) that are in source but not in destination
Dim diffRows = srcUnique.Except(destUnique).ToList

'get all new emails(according to ticket_id) for calculating contact columns
Dim newRowsTickets = (From row In src.Email_Total
                     Join d In diffRows
                     On row.ticket_id Equals d.Ticket_ID _
                     And row.interaction Equals d.Interaction _
                     And row.modified_time Equals d.ModifiedAt
                     Group row By Ticket_ID = row.ticket_id Into NewTicketRows = Group).ToList

For Each ticket In newRowsTickets
     Dim contact = dest.Contact.FindByTicket_IDfiContactType(ticket.Ticket_ID, emailContactType.idContactType)
     If contact Is Nothing Then
          ' Create new Contact with many sub-queries on this ticket(omitted) ****'
          Dim newContact = Me.dest.Contact.NewContactRow
          dest.Contact.AddContactRow(newContact)
          contact = newContact
     Else
          ' Update Contact with many sub-queries on this ticket(omitted) '
     End If
     daContact.Update(dest.Contact)

     ' Add new ContactDetail-Rows from this Ticket(this is the counterpart of the src.Email_Total-Rows, details omitted) '
     For Each newRow In ticket.NewTicketRows
         Dim newContactDetail = dest.ContactDetail.NewContactDetailRow
         newContactDetail.ContactRow = contact
         dest.ContactDetail.AddContactDetailRow(newContactDetail)
     Next
     daContactDetails.Update(dest.ContactDetail)
Next

注意 daContact daContactDetails SqlDataAdapters DEST 数据集联系方式 ContactDetail 数据表,其中每ContactDetail属于一个联系人。

Note: daContact and daContactDetails are SqlDataAdapters, source and dest are DataSets and Contact and ContactDetail are DataTables, where every ContactDetail belongs to a Contact.

即使不两个内核将使用100%的CPU,我认为它会增加的性能,如果显著我将并行化查询,因为第二芯几乎是空闲的。该每个也可能是一个好地方,以优化,因为门票不相互关联的。所以,我认为我可以循环使用多线程和创建/更新记录并行。但如何与PLINQ办呢?

Even if not both cores would use 100% CPU, I assume that it would increase performance significantly if I would parallelize the queries, because the second core is nearly idle. The for each might also be a good place to optimize since the tickets are not related to each other. So I assume that I could loop with multiple threads and create/update records parallel. But how to do it with PLINQ?

旁注:正如我在评论中所提到的,性能是不是对我来说是关键因素,到目前为止,由于服务器的唯一目的是同步的MySQL数据库(在另一个服务器)与MS SQL服务器(在同一台服务器这个Windows服务上)。它可作为由另一服务生成的报告的来源。但这些报告每天只产生一次。但除此之外,我感兴趣的学习PLINQ,因为我认为这可能是一个极好的锻炼。 它采用提到1,5h只有当目标DB是空的,必须创建的所有记录。如果这两个数据库几乎保持同步,这种方法只需要〜1分钟呢。在未来的表现将成为自的电子邮件更重要的只有一个接触的几个类型(聊天+通话将超过1mil.records)的。我想,我反正需要某种形式的(LINQ)数据分页的话。

Side Note: As I've mentioned in the comments, performance is not a key factor for me so far, since the server's only purpose is to synchronize the MySQL Database(on another server) with a MS SQL-Server(on the same server as this Windows-Service). It acts as a source for reports that are generated by another service. But these reports are only generated once a day. But apart from that I was interested in learning PLINQ because I thought that this could be an excellent exercise. It takes the mentioned 1,5h only if destination DB is empty and all records must be created. If both databases are nearly in sync, this method takes only ~1 minute yet. In future performance will become more important since email is only one of several contact-types(chat+calls will exceed 1mil.records). I think that I'll anyway need some kind of (LINQ) Data-Paging then.

如果有任何不明白我会相应地更新我的答案。先谢谢了。

If something is unclear I'll update my answer accordingly. Thanks in advance.

修改:这里是我的调查和尝试的结果:

Edit: Here is the result of my investigations and attempts:

问:怎样PLINQ与现有的LINQ查询联接

Question: How to "PLINQ" an existing LINQ query with joins?

:请注意,有些LINQ运营商选择Binary-他们采取两种IEnumerables作为输入。加入这样操作的一个很好的例子。在这些情况下,最左侧的数据源的类型决定的LINQ或PLINQ是否被使用。因此,你只需要调用进行AsParallel第一个数据源上查询以并行方式运行:

Answer: Note that some LINQ operators are binary—they take two IEnumerables as input. Join is a perfect example of such an operator. In these cases, the type of the left-most data source determines whether LINQ or PLINQ is used. Thus you need only call AsParallel on the first data source for your query to run in parallel:

IEnumerable<T> leftData = ..., rightData = ...;
var q = from x in leftData.AsParallel()
        join y in rightData on x.a == y.b
        select f(x, y);

但是,如果我改变我的查询方式如下(注意进行AsParallel ):

Dim newRowsTickets = (From row In src.Email_Total.AsParallel()
                                        Join d In diffRows
                                        On row.ticket_id Equals d.Ticket_ID _
                                        And row.interaction Equals d.Interaction _
                                        And row.modified_time Equals d.ModifiedAt
                                    Group row By Ticket_ID = row.ticket_id Into NewTicketRows = Group).ToList

,编译器会抱怨,我需要进行AsParallel 添加到右侧的数据源也。因此,这似乎是一个VB.NET的问题或缺少文件(文章来自2007年)。我认为后者是因为(除此之外推荐)的文章也说,你需要添加 System.Concurrency.dll 手动,但实际上它是.NET 4.0框架和部分在命名空间 Sytem.Threading.Tasks

The compiler will complain that I need to add AsParallel to the right datasource as well. So this seem to be a VB.NET issue or a lack of documentation(article is from 2007). I assume the latter because the(apart from that recommendable) article also says that you need to add System.Concurrency.dll manually but actually it is part of .NET 4.0 Framework and in Namespace Sytem.Threading.Tasks.

我意识到,我不会从利润并行除了因为查询是在顺序模式(甚至排的两个收集几乎相同数量的速度不够快而结果在比较中的最大数目,我得到的结果,在不到30秒)。但我会添加它为求完整版本。

I realized that I won't profit from a parallelized Except since the query is fast enough in sequential mode(even with nearly the same number of rows in both collection which results in the maximum number of comparisons, I got the result in less than 30 seconds). But I will add it for the sake of completeness later.

所以,我决定并行化的for-each 什么是使用LINQ查询一样简单,你只需要添加进行AsParallel()结尾。 但我意识到我需要强制与 WithExecutionMode(ParallelExecutionMode.ForceParallelism),否则.NET决定只使用一个核心这个循环的并行性。我也想告诉.NET,我希望用尽可能多的主题是可能的,但不超过8: WithDegreeOfParallelism(8)

So I decided to parallelize the for-each what is as easy as with LINQ-Queries, you simply need to add AsParallel() at the end. But I realized that I need to force the parallelism with WithExecutionMode(ParallelExecutionMode.ForceParallelism), otherwise .NET decides to use only one core for this loop. I also wanted to tell .NET that I wish to use as many Threads as possible but not more than 8: WithDegreeOfParallelism(8).

现在两个核心工作的同时,但CPU使用率保持在54%。

Now both cores are working at the same time, but the CPU usage stays on 54%.

所以这是PLINQ版本至今:

So this is the PLINQ version so far:

Dim diffRows = srcUnique.AsParallel.Except(destUnique.AsParallel).ToList

Dim newRowsTickets = (From row In src.Email_Total.AsParallel()
                        Join d In diffRows.AsParallel()
                        On row.ticket_id Equals d.Ticket_ID _
                        And row.interaction Equals d.Interaction _
                        And row.modified_time Equals d.ModifiedAt
                    Group row By Ticket_ID = row.ticket_id Into NewTicketRows = Group).ToList

For Each ticket In newRowsTickets.
                    AsParallel().
                      WithDegreeOfParallelism(8).
                       WithExecutionMode(ParallelExecutionMode.ForceParallelism)
    '  blah,blah ...  '

    'add new ContactDetails for this Ticket(only new rows)
    For Each newRow In ticket.NewTicketRows.
                                AsParallel().
                                    WithExecutionMode(ParallelExecutionMode.Default)
        ' blah,blah ... '
    Next
    daContactDetails.Update(dest.ContactDetail)
Next

不幸的是我没有看到任何性能优势使用进行AsParallel 与顺序模式的比较:

Unfortunately I don't see any performance benefits from using AsParallel in comparison with sequential mode:

每个进行AsParallel (HH:MM:ss.mm):

The for each with AsParallel(hh:mm:ss.mm):

09/29/2011 18:54:36: Contacts/ContactDetails created or modified. Duration: 01:21:34.40

和不带:

09/29/2011 16:02:55: Contacts/ContactDetails created or modified. Duration: 01:21:24.50

有人可以解释我这样的结果?对于每个负责在相似的时间在数据库写访问?

Can somebody explain me this result? Is the database' write access in the for each responsible for the similar time?

以下是推荐阅读:

  • http://msdn.microsoft.com/en-us/magazine/ cc163329.aspx (没有及时更新)
    • http://msdn.microsoft.com/en-us/magazine/cc163329.aspx (not up-to-date)
      • List of changes since above article

      推荐答案

      有3点值得进一步研究,

      There are 3 points worth investigating further,

      1. 请不要使用.toList()。我可能是错的,但我认为用.ToList 这样就不会允许编译器优化查询,如果 进一步优化是可能的。
      2. 使用自己的过滤操作来比较两个数据 destionations。它可能会给你更好的性能。
      3. 看看你是否可以使用<一个href="http://blogs.msdn.com/b/erickt/archive/2008/05/19/linq-to-dataset-linqdataview-and-indexes.aspx"相对=nofollow> LinqDataview 以提供更好的 性能。

      1. Do not use .toList(). I might be wrong but I think using .ToList this way would not allow the compiler to optimize the query, if further optimization was possible.
      2. Use your own filtering operation to compare data from both destionations. It might give you better performance.
      3. See if you could use LinqDataview to provide better performance.

      我不认为你将获得PLINQ的优势,而这样做的插入。看<一href="http://stackoverflow.com/questions/3290353/is-it-ok-to-use-plinq-forall-for-a-bulk-insert-into-database/3290406#3290406">this回答更多的细节。

      I dont think you will gain an advantage of PLinq while doing insertion. Look at this answer for more details.

      希望有所帮助。请不要问,如果你需要任何的以上几点澄清。

      Hope that helps. Please do ask if you need clarification on any of the above points.

      这篇关于如何PLINQ现有的LINQ查询与加盟?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆