如何在没有缓冲的情况下使用单个枚举检查IEnumerable的多个条件? [英] How to check an IEnumerable for multiple conditions with a single enumeration without buffering?

查看:46
本文介绍了如何在没有缓冲的情况下使用单个枚举检查IEnumerable的多个条件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的数据序列很长,是IEnumerable的形式,我想检查一下是否有多种情况.每个条件都返回true或false值,我想知道所有条件是否为true.我的问题是我无法通过调用ToList来实现IEnumerable,因为它太长了(> 10,000,000,000个元素).我也不愿意多次枚举序列,每个条件一次都枚举,因为每次我都会得到一个不同的序列.我正在寻找一种有效的方法来执行此检查,并尽可能使用现有的LINQ功能.

I have a very long sequence of data is the form of IEnumerable, and I would like to check it for a number of conditions. Each condition returns a value of true or false, and I want to know if all conditions are true. My problem is that I can not afford to materialize the IEnumerable by calling ToList, because it is simply too long (> 10,000,000,000 elements). Neither I can afford to enumerate the sequence multiple times, one for each condition, because each time I will get a different sequence. I am searching for an efficient way to perform this check, using the existing LINQ functionality if possible.

说明:我要求的是一般解决方案,而不是下面提出的具体示例问题的解决方案.

Clarification: I am asking for a general solution, not for a solution of the specific example problem that is presented bellow.

这是我的序列的虚拟版本:

Here is a dummy version of my sequence:

static IEnumerable<int> GetLongSequence()
{
    var random = new Random();
    for (long i = 0; i < 10_000_000_000; i++) yield return random.Next(0, 100_000_000);
}

这是序列必须满足的条件的一个示例:

And here is an example of the conditions that the sequence must satisfy:

var source = GetLongSequence();
var result = source.Any(n => n % 28_413_803 == 0)
    && source.All(n => n < 99_999_999)
    && source.Average(n => n) > 50_000_001;

不幸的是,这种方法调用的次数是GetLongSequence的三倍,因此它不能满足问题的要求.

Unfortunately this approach invokes three times the GetLongSequence, so it doesn't satisfy the requirements of the problem.

我试图编写上面的Linqy扩展方法,希望这可以给我一些想法:

I tried to write a Linqy extension method of the above, hoping that this could give me some ideas:

public static bool AllConditions<TSource>(this IEnumerable<TSource> source,
    params Func<IEnumerable<TSource>, bool>[] conditions)
{
    foreach (var condition in conditions)
    {
        if (!condition(source)) return false;
    }
    return true;
}

这就是我打算使用它的方式:

This is how I intend to use it:

var result = source.AllConditions
(
    s => s.Any(n => n % 28_413_803 == 0),
    s => s.All(n => n < 99_999_999),
    s => s.Average(n => n) > 50_000_001,
    // more conditions...
);

不幸的是,这没有任何改善. GetLongSequence再次被调用三次.

Unfortunately this offers no improvement. The GetLongSequence is again invoked three times.

在没有任何进展的情况下将我的头撞在墙上一个小时之后,我想出了一个可能的解决方案.我可以在单独的线程中运行每个条件,并使它们对序列的单个共享枚举器的访问同步.所以我最终变成了这种怪物:

After hitting my head against the wall for an hour, without making any progress, I figured out a possible solution. I could run each condition in a separate thread, and synchronize their access to a single shared enumerator of the sequence. So I ended up with this monstrosity:

public static bool AllConditions<TSource>(this IEnumerable<TSource> source,
    params Func<IEnumerable<TSource>, bool>[] conditions)
{
    var locker = new object();
    var enumerator = source.GetEnumerator();
    var barrier = new Barrier(conditions.Length);
    long index = -1;
    bool finished = false;

    IEnumerable<TSource> OneByOne()
    {
        try
        {
            while (true)
            {
                TSource current;
                lock (locker)
                {
                    if (finished) break;
                    if (barrier.CurrentPhaseNumber > index)
                    {
                        index = barrier.CurrentPhaseNumber;
                        finished = !enumerator.MoveNext();
                        if (finished)
                        {
                            enumerator.Dispose(); break;
                        }
                    }
                    current = enumerator.Current;
                }
                yield return current;
                barrier.SignalAndWait();
            }
        }
        finally
        {
            barrier.RemoveParticipant();
        }
    }

    var results = new ConcurrentQueue<bool>();
    var threads = conditions.Select(condition => new Thread(() =>
    {
        var result = condition(OneByOne());
        results.Enqueue(result);
    })
    { IsBackground = true }).ToArray();
    foreach (var thread in threads) thread.Start();
    foreach (var thread in threads) thread.Join();
    return results.All(r => r);
}

为了进行同步,使用了 Barrier .这个解决方案实际上比我想象的要好.它每秒可以在我的机器上处理近1,000,000个元素.但是速度不够快,因为它需要近3个小时来处理10,000,000,000个元素的完整序列.而且我等不及结果超过5分钟.关于如何在单个线程中高效运行这些条件的任何想法?

For the synchronization a used a Barrier. This solution actually works way better than I thought. It can process almost 1,000,000 elements per second in my machine. It is not fast enough though, since it needs almost 3 hours to process the full sequence of 10,000,000,000 elements. And I can't wait for the result for longer than 5 minutes. Any ideas about how I could run these conditions efficiently in a single thread?

推荐答案

如果您需要确保仅对序列进行一次枚举,则对整个序列进行操作的条件将无用. 我想到的一种可能性是,要为序列中的每个元素调用一个接口,并针对您的特定条件以不同的方式实现此接口:

If you need to ensure that the sequence is enumerated only once, conditions operating on the whole sequence are not useful. One possibility that comes to my mind is to have an interface which is called for each element of the sequence and implement this interface in different ways for your specific conditions:

bool Example()
{
    var source = GetLongSequence();

    var conditions = new List<IEvaluate<int>>
    {
        new Any<int>(n => n % 28_413_803 == 0),
        new All<int>(n => n < 99_999_999),
        new Average(d => d > 50_000_001)
    };

    foreach (var item in source)
    {
        foreach (var condition in conditions)
        {
            condition.Evaluate(item);
        }
    }

    return conditions.All(c => c.Result);   
}

static IEnumerable<int> GetLongSequence()
{
    var random = new Random();
    for (long i = 0; i < 10_000_000_000; i++) yield return random.Next(0, 100_000_000);
}

interface IEvaluate<T>
{
    void Evaluate(T item);
    bool Result { get; }
}

class Any<T> : IEvaluate<T>
{
    private bool _result;
    private readonly Func<T, bool> _predicate;

    public Any(Func<T, bool> predicate)
    {
        _predicate = predicate;
        _result = false;
    }

    public void Evaluate(T item)
    {
        if (_predicate(item))
        {
            _result = true;
        }
    }

    public bool Result => _result;
}


class All<T> : IEvaluate<T>
{
    private bool _result;
    private readonly Func<T, bool> _predicate;

    public All(Func<T, bool> predicate)
    {
        _predicate = predicate;
        _result = true;
    }

    public void Evaluate(T item)
    {
        if (!_predicate(item))
        {
            _result = false;
        }
    }

    public bool Result => _result;
}

class Average : IEvaluate<int>
{
    private long _sum;
    private int _count;
    Func<double, bool> _evaluate;
    public Average(Func<double, bool> evaluate)
    {
    }

    public void Evaluate(int item)
    {
        _sum += item;
        _count++;
    }

    public bool Result => _evaluate((double)_sum / _count);
}

这篇关于如何在没有缓冲的情况下使用单个枚举检查IEnumerable的多个条件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆