一组中的LINQ到对象索引+用于不同分组(也称为ROW_NUMBER,与PARTITION BY等效) [英] LINQ-to-objects index within a group + for different groupings (aka ROW_NUMBER with PARTITION BY equivalent)

查看:429
本文介绍了一组中的LINQ到对象索引+用于不同分组(也称为ROW_NUMBER,与PARTITION BY等效)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

经过大量的Google搜索和代码实验之后,我难以理解复杂的C#LINQ到对象的问题,在SQL中使用一对ROW_NUMBER()... PARTITION BY函数和子查询或二。



以下是我想要在代码中执行的操作 - 基本要求是从列表中删除重复的文档:


  1. 首先,按照(Document.Title,Document.SourceId)对列表进行分组,假设一个(简化的)类定义,如下所示:

     
    class Document
    {
    string Title;
    int SourceId; //源优先(ID = 1优于ID = 2)
    }


    在该组内,为每个文档分配一个索引(例如索引0 ==第一个来自此源标题的文档,索引1 =来自此源的具有此标题的第二个文档等)。我很喜欢SQL中的ROW_NUMBER()相当于!



    现在由(Document.Title,Index)组合,其中在步骤#2中计算索引。对于每个组,只返回一个文档:具有最低Document.SourceId的文档。


步骤#1很简单(例如codepronet.blogspot.com/2009/01/group-by-in- linq.html),但我被困在步骤#2和#3。我似乎无法构建一个无红点的C#LINQ查询来解决所有这三个步骤。

Anders Heilsberg在这个主题是我认为上面的步骤#2和#3的答案,如果我可以正确的语法。



我宁愿避免使用外部局部变量来执行Index计算,正如slodge.blogspot.com/2009/01/adding-row-umber -using-linq-to-objects.html,因为如果外部变量被修改,该解决方案会中断。

理想情况下,可以先完成按组标题的步骤,因此内部分组(首先由Source计算索引,然后按索引过滤重复)可以在每个by title组中使用少量的对象,因为每个by-title组中的文档数量通常在100以下。我真的不想要一个N 2 解决方案!



我当然可以用嵌套的foreach循环来解决这个问题,但这似乎是LINQ应该很简单的问题。



有什么想法? 你的分组是成对的值(标题+源标题,然后标题+索引)。这里有一个LINQ查询(主要)解决方案:

  var selectedFew = 
from doc in docs
group doc ((d,i)=> new {Doc = d,Index = i})中的docIndex将新的{doc.Title,doc.SourceId}转换为g

group docIndex by new(docIndex.Doc.Title,docIndex.Index)到g
中选择g.Aggregate((a,b)=>(a.Doc.SourceId< = b.Doc.SourceId)?a:b );

首先我们按标题+ SourceId进行分组(我使用匿名类型,因为编译器为分组查找)。然后我们使用Select将分组索引附加到我们在第二个分组中使用的文档。最后,对于每个组我们选择最低的SourceId。



给出以下输入:

  var docs = new [] {
new {Title =ABC,SourceId = 0},
new {Title =ABC,SourceId = 4},
new {Title =ABC,SourceId = 2},
new {Title =123,SourceId = 7},
new {Title =123,SourceId = 7},
新{Title =123,SourceId = 7},
new {Title =123,SourceId = 5},
new {Title =123,SourceId = 5},
};

我得到这个输出:



<$ p $
{Doc = {Title = 123,SourceId = 5},Index = 0}
{Doc = {Title = ABC,SourceId = 0},Index = 0} {Doc = {Title = 123,SourceId = 5},Index = 1}
{Doc = {Title = 123,SourceId = 7},Index = 2}

更新:我刚刚看到了关于按标题分组的问题。您可以在Title组中使用子查询来完成此操作:

  var selectedFew = 
from doc in docs
group doc by doc.Title into titleGroup
from docWithIndex in

from doc in titleGroup
group doc by doc.SourceId into idGroup $ b $ from docIndex in idGroup。选择((d,i)=> new {Doc = d,Index = i})
docIndex by docIndex.Index into indexGroup
select indexGroup.Aggregate((a,b)=> (a.Doc.SourceId< = b.Doc.SourceId)?a:b)

select docWithIndex;


After much Google searching and code experimentation, I'm stumped on a complex C# LINQ-to-objects problem which in SQL would be easy to solve with a pair of ROW_NUMBER()...PARTITION BY functions and a subquery or two.

Here's, in words, what I'm trying to do in code-- the underlying requirement is removing duplicate documents from a list:

  1. First, group a list by (Document.Title, Document.SourceId), assuming a (simplified) class definition like this:

    class Document
    {
        string Title;
        int SourceId; // sources are prioritized (ID=1 better than ID=2)
    }

  2. Within that group, assign each document an index (e.g. Index 0 == 1st document with this title from this source, Index 1 = 2nd document with this title from this source, etc.). I'd love the equivalent of ROW_NUMBER() in SQL!

  3. Now group by (Document.Title, Index), where Index was computed in Step #2. For each group, return only one document: the one with the lowest Document.SourceId.

Step #1 is easy (e.g. codepronet.blogspot.com/2009/01/group-by-in-linq.html), but I'm getting stumped on steps #2 and #3. I can't seem to build a red-squiggle-free C# LINQ query to solve all three steps.

Anders Heilsberg's post on this thread is I think the answer to Steps #2 and #3 above if I could get the syntax right.

I'd prefer to avoid using an external local variable to do the Index computation, as recommended on slodge.blogspot.com/2009/01/adding-row-number-using-linq-to-objects.html, since that solution breaks if the external variable is modified.

Optimally, the group-by-Title step could be done first, so the "inner" groupings (first by Source to compute the index, then by Index to filter out duplicates) can operate on small numbers of objects in each "by title" group, since the # of documents in each by-title group is usually under 100. I really don't want an N2 solution!

I could certainly solve this with nested foreach loops, but it seems like the kind of problem which should be simple with LINQ.

Any ideas?

解决方案

I think jpbochi missed that you want your groupings to be by pairs of values (Title+SourceId then Title+Index). Here's a LINQ query (mostly) solution:

var selectedFew = 
    from doc in docs
    group doc by new { doc.Title, doc.SourceId } into g
    from docIndex in g.Select((d, i) => new { Doc = d, Index = i })
    group docIndex by new { docIndex.Doc.Title, docIndex.Index } into g
    select g.Aggregate((a,b) => (a.Doc.SourceId <= b.Doc.SourceId) ? a : b);

First we group by Title+SourceId (I use an anonymous type because the compiler builds a good hashcode for the grouping lookup). Then we use Select to attach the grouped index to the document, which we use in our second grouping. Finally, for each group we pick the lowest SourceId.

Given this input:

var docs = new[] {
    new { Title = "ABC", SourceId = 0 },
    new { Title = "ABC", SourceId = 4 },
    new { Title = "ABC", SourceId = 2 },
    new { Title = "123", SourceId = 7 },
    new { Title = "123", SourceId = 7 },
    new { Title = "123", SourceId = 7 },
    new { Title = "123", SourceId = 5 },
    new { Title = "123", SourceId = 5 },
};

I get this output:

{ Doc = { Title = ABC, SourceId = 0 }, Index = 0 }
{ Doc = { Title = 123, SourceId = 5 }, Index = 0 }
{ Doc = { Title = 123, SourceId = 5 }, Index = 1 }
{ Doc = { Title = 123, SourceId = 7 }, Index = 2 }

Update: I just saw your question about grouping by Title first. You can do this using a subquery on your Title groups:

var selectedFew =
    from doc in docs
    group doc by doc.Title into titleGroup
    from docWithIndex in
        (
            from doc in titleGroup
            group doc by doc.SourceId into idGroup
            from docIndex in idGroup.Select((d, i) => new { Doc = d, Index = i })
            group docIndex by docIndex.Index into indexGroup
            select indexGroup.Aggregate((a,b) => (a.Doc.SourceId <= b.Doc.SourceId) ? a : b)
        )
    select docWithIndex;

这篇关于一组中的LINQ到对象索引+用于不同分组(也称为ROW_NUMBER,与PARTITION BY等效)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆