确定数据表中的重复项 [英] Determining duplicates in a datatable

查看:131
本文介绍了确定数据表中的重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个从CSV文件加载的数据表。根据datatable中的两列( product_id owner_org_id ),我需要确定哪些行是重复的。一旦我确定了,我可以使用该信息来构建我的结果,这是一个仅包含不唯一的行的数据表,以及仅包含唯一的行的数据表。



我已经看过这里的其他例子,我到目前为止编写和执行的代码,但似乎认为数据中的每一行都是唯一的。实际上在测试数据中有13行,只有6是唯一的。很明显我在做错事。



编辑:我应该注意,有重复的行应该 ALL

EDIT2 :或者,如果我可以选择所有重复的行(而不是尝试选择唯一的行)对我来说很好。无论哪种方式都可以让我得到最终结果。



处理方法中的代码:

  MyRowComparer myrc = new MyRowComparer(); 
var uniquerows = dtCSV.AsEnumerable()。Distinct(myrc);

以及以下内容:

  public class MyRowComparer:IEqualityComparer< DataRow> 
{
public bool Equals(DataRow x,DataRow y)
{
// return((string.Compare(x.Field< string>(PRODUCT_ID),y .Field< string>(PRODUCT_ID),true))==
//(string.Compare(x.Field< string>(OWNER_ORG_ID),y.Field< string>(OWNER_ORG_ID ,true)));
return
x.ItemArray.Except(new object [] {x [x.Table.Columns [PRODUCT_ID]。ColumnName]})==
y.ItemArray.Except(new object [] {y [y.Table.Columns [PRODUCT_ID]。ColumnName]})&&
x.ItemArray.Except(new object [] {x [x.Table.Columns [OWNER_ORG_ID]。ColumnName]})==
y.ItemArray.Except(new object [] {y [y.Table.Columns [OWNER_ORG_ID]。ColumnName]});
}

public int GetHashCode(DataRow obj)
{
int y = int.Parse(obj.Field< string>(PRODUCT_ID));
int z = int.Parse(obj.Field< string>(OWNER_ORG_ID));
int c = y ^ z;
return c;
}
}


解决方案

你可以使用LINQ-To-DataSet和 Enumerable.Except / 相交

  var tbl1ID = tbl1.AsEnumerable()
.Select(r => new
{
product_id = r.Field< String> ;(product_id),
owner_org_id = r.Field< String>(owner_org_id),
});
var tbl2ID = tbl2.AsEnumerable()
.Select(r => new
{
product_id = r.Field< String>(product_id),
owner_org_id = r.Field< String>(owner_org_id),
});


var unique = tbl1ID.Except(tbl2ID);
var both = tbl1ID.Intersect(tbl2ID);

var tblUnique =(from uniqueRow in unique
join row in tbl1.AsEnumerable()
on uniqueRow equals new
{
product_id = row.Field< ; String>(product_id),
owner_org_id = row.Field< String>(owner_org_id)
}
选择行).CopyToDataTable();
var tblBoth =(from bothRow in both
join row in tbl1.AsEnumerable()
on BothRow equals new
{
product_id = row.Field< String> product_id),
owner_org_id = row.Field< String>(owner_org_id)
}
选择行).CopyToDataTable();






修改:显然我误解了你的要求。所以你只有一个 DataTable ,并希望获得所有唯一的和所有重复的行,这更直接。您可以使用 Enumerable.GroupBy ,其中包含两个字段的匿名类型:

  var groups = tbl1.AsEnumerable()
.GroupBy(r => new
{
product_id = r.Field< String>(product_id),
owner_org_id = r.Field< String>(owner_org_id )
});
var tblUniques = groups
.Where(grp => grp.Count()== 1)
.Select(grp => grp.Single())
。 CopyToDataTable();
var tblDuplicates = groups
.Where(grp => grp.Count()> 1)
.SelectMany(grp => grp)
.CopyToDataTable();


I have a data table I've loaded from a CSV file. I need to determine which rows are duplicates based on two columns (product_id and owner_org_id) in the datatable. Once I've determined that, I can use that information to build my result, which is a datatable containing only the rows that are not unique, and a data table containing only the rows that are unique.

I've looked at other examples on here and the code I've come up with so far does compile and execute, but it seems to think every row in the data is unique. In reality in the test data there's 13 rows and only 6 are unique. So clearly I'm doing something wrong.

EDIT: Thought I should note, rows that have duplicates should ALL be removed, not just the duplicates of that row. eg if there are 4 duplicates, all 4 should be removed not 3, leaving one unique row from the 4.

EDIT2: Alternatively, if I can select all duplicate rows (instead of trying to select unique rows) it is fine with me. Either way can get me to my end result.

The code in the processing method:

MyRowComparer myrc = new MyRowComparer();
var uniquerows = dtCSV.AsEnumerable().Distinct(myrc);

along with the following:

public class MyRowComparer : IEqualityComparer<DataRow>
{
    public bool Equals(DataRow x, DataRow y)
    {
        //return ((string.Compare(x.Field<string>("PRODUCT_ID"),   y.Field<string>("PRODUCT_ID"),   true)) ==
        //        (string.Compare(x.Field<string>("OWNER_ORG_ID"), y.Field<string>("OWNER_ORG_ID"), true)));
        return
            x.ItemArray.Except(new object[] { x[x.Table.Columns["PRODUCT_ID"].ColumnName] }) ==
            y.ItemArray.Except(new object[] { y[y.Table.Columns["PRODUCT_ID"].ColumnName] }) &&
            x.ItemArray.Except(new object[] { x[x.Table.Columns["OWNER_ORG_ID"].ColumnName] }) ==
            y.ItemArray.Except(new object[] { y[y.Table.Columns["OWNER_ORG_ID"].ColumnName] });
    }

    public int GetHashCode(DataRow obj)
    {
        int y = int.Parse(obj.Field<string>("PRODUCT_ID"));
        int z = int.Parse(obj.Field<string>("OWNER_ORG_ID"));
        int c = y ^ z;
        return c;
    }
}

解决方案

You could use LINQ-To-DataSet and Enumerable.Except/Intersect:

var tbl1ID = tbl1.AsEnumerable()
        .Select(r => new
        {
            product_id = r.Field<String>("product_id"),
            owner_org_id = r.Field<String>("owner_org_id"),
        });
var tbl2ID = tbl2.AsEnumerable()
        .Select(r => new
        {
            product_id = r.Field<String>("product_id"),
            owner_org_id = r.Field<String>("owner_org_id"),
        });


var unique = tbl1ID.Except(tbl2ID);
var both = tbl1ID.Intersect(tbl2ID);

var tblUnique = (from uniqueRow in unique
                join row in tbl1.AsEnumerable()
                on uniqueRow equals new
                {
                    product_id = row.Field<String>("product_id"),
                    owner_org_id = row.Field<String>("owner_org_id")
                }
                select row).CopyToDataTable();
var tblBoth = (from bothRow in both
              join row in tbl1.AsEnumerable()
              on bothRow equals new
              {
                  product_id = row.Field<String>("product_id"),
                  owner_org_id = row.Field<String>("owner_org_id")
              }
              select row).CopyToDataTable();


Edit: Obviously i've misunderstood your requirement a little bit. So you only have one DataTable and want to get all unique and all duplicate rows, that's even more straight-forward. You can use Enumerable.GroupBy with an anonymous type containing both fields:

var groups = tbl1.AsEnumerable()
    .GroupBy(r => new
    {
        product_id = r.Field<String>("product_id"),
        owner_org_id = r.Field<String>("owner_org_id")
    });
var tblUniques = groups
    .Where(grp => grp.Count() == 1)
    .Select(grp => grp.Single())
    .CopyToDataTable();
var tblDuplicates = groups
    .Where(grp => grp.Count() > 1)
    .SelectMany(grp => grp)
    .CopyToDataTable();

这篇关于确定数据表中的重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆