大名单上的plinq需要花费大量时间 [英] plinq on large lists taking enormous time

查看:98
本文介绍了大名单上的plinq需要花费大量时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在记忆列表中有两个游戏,而消费者有一个1500万个对象,另一个有约300万个。

I have two in memory lists plays and consumers one having 15 mil objects and the other around 3 mil.

以下是我要触发的一些查询..

the following are a few of queries i'm firing..

consumersn=consumers.AsParallel()
                    .Where(w => plays.Any(x => x.consumerid == w.consumerid))
                    .ToList();


List<string> consumerids = plays.AsParallel()
                                .Where(w => w.playyear == group_period.year 
                                         && w.playmonth == group_period.month 
                                         && w.sixteentile == group_period.group)
                                .Select(c => c.consumerid)
                                .ToList();


int groupcount = plays.AsParallel()
                      .Where(w => w.playyear == period.playyear 
                               && w.playmonth == period.playmonth 
                               && w.sixteentile == group 
                               && consumerids.Any(x => x == w.consumerid))
                      .Count();

我正在使用16核计算机和32 GB RAM,尽管如此..第一个查询还是大约需要运行20个小时。

I'm using 16 core machine with 32 GB RAM, inspite of this.. the first query took around 20 hours to run..

我做错了什么事。.

感谢所有帮助。

谢谢

推荐答案

第一个LINQ查询效率很低,并行化

The first LINQ query is very inefficient, parallelization can only help you so much.

说明:当您编写消费者时,其中(w =>播放。任何(x => x .consumerid == w.consumerid)),这意味着对于 consumer 中的每个对象,您都有可能遍历整个播放列表以查找受影响的消费者。因此,最多可容纳300万消费者乘以1500万次播放= 45万亿次运营。即使跨越16个内核,每个内核也大约有2.8万亿次操作。

Explanation: When you write consumers.Where(w => plays.Any(x => x.consumerid == w.consumerid)), it means that, for every object in consumer, you will potentially iterate over the whole plays list to find the affected consumers. So that is a maximum of 3 million consumers times 15 million plays = 45 trillion operations. Even across 16 cores, that is about 2.8 trillion operations per core.

因此,此处的第一步是将所有播放按其ConsumerId分组,并缓存结果

So, the first step here would be to group all plays by their consumerIds, and to cache the result in an appropriate data structure:

var playsByConsumerIds = plays.ToLookup(x => x.consumerid, StringComparer.Ordinal);

然后,您的第一个请求变为:

Then, your first request becomes:

consumersn = consumers.Where(w => playsByConsumerIds.Contains(w.consumerid)).ToList();

即使没有任何并行化,此查询也应该更快。

This query should be much faster, even without any parallelization.

我无法解决以下查询,因为我看不到 group_period 到底在做什么,但是我建议使用 GroupBy ToLookup 一次创建所有组。

I cannot fix the following queries because I don't see exactly what you are doing exactly with group_period, but I would suggest using GroupBy or ToLookup to create all groups in a single pass.

这篇关于大名单上的plinq需要花费大量时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆