确定数据表中的重复项 [英] Determining duplicates in a datatable
问题描述
product_id
和 owner_org_id
),我需要确定哪些行是重复的。一旦我确定了,我可以使用该信息来构建我的结果,这是一个仅包含不唯一的行的数据表,以及仅包含唯一的行的数据表。 我已经看过这里的其他例子,我到目前为止编写和执行的代码,但似乎认为数据中的每一行都是唯一的。实际上在测试数据中有13行,只有6是唯一的。很明显我在做错事。
编辑:我应该注意,有重复的行应该 ALL
EDIT2 :或者,如果我可以选择所有重复的行(而不是尝试选择唯一的行)对我来说很好。无论哪种方式都可以让我得到最终结果。
处理方法中的代码:
MyRowComparer myrc = new MyRowComparer();
var uniquerows = dtCSV.AsEnumerable()。Distinct(myrc);
以及以下内容:
public class MyRowComparer:IEqualityComparer< DataRow>
{
public bool Equals(DataRow x,DataRow y)
{
// return((string.Compare(x.Field< string>(PRODUCT_ID),y .Field< string>(PRODUCT_ID),true))==
//(string.Compare(x.Field< string>(OWNER_ORG_ID),y.Field< string>(OWNER_ORG_ID ,true)));
return
x.ItemArray.Except(new object [] {x [x.Table.Columns [PRODUCT_ID]。ColumnName]})==
y.ItemArray.Except(new object [] {y [y.Table.Columns [PRODUCT_ID]。ColumnName]})&&
x.ItemArray.Except(new object [] {x [x.Table.Columns [OWNER_ORG_ID]。ColumnName]})==
y.ItemArray.Except(new object [] {y [y.Table.Columns [OWNER_ORG_ID]。ColumnName]});
}
public int GetHashCode(DataRow obj)
{
int y = int.Parse(obj.Field< string>(PRODUCT_ID));
int z = int.Parse(obj.Field< string>(OWNER_ORG_ID));
int c = y ^ z;
return c;
}
}
你可以使用LINQ-To-DataSet和 Enumerable.Except
/ 相交
:
var tbl1ID = tbl1.AsEnumerable()
.Select(r => new
{
product_id = r.Field< String> ;(product_id),
owner_org_id = r.Field< String>(owner_org_id),
});
var tbl2ID = tbl2.AsEnumerable()
.Select(r => new
{
product_id = r.Field< String>(product_id),
owner_org_id = r.Field< String>(owner_org_id),
});
var unique = tbl1ID.Except(tbl2ID);
var both = tbl1ID.Intersect(tbl2ID);
var tblUnique =(from uniqueRow in unique
join row in tbl1.AsEnumerable()
on uniqueRow equals new
{
product_id = row.Field< ; String>(product_id),
owner_org_id = row.Field< String>(owner_org_id)
}
选择行).CopyToDataTable();
var tblBoth =(from bothRow in both
join row in tbl1.AsEnumerable()
on BothRow equals new
{
product_id = row.Field< String> product_id),
owner_org_id = row.Field< String>(owner_org_id)
}
选择行).CopyToDataTable();
修改:显然我误解了你的要求。所以你只有一个 DataTable
,并希望获得所有唯一的和所有重复的行,这更直接。您可以使用 Enumerable.GroupBy
,其中包含两个字段的匿名类型:
var groups = tbl1.AsEnumerable()
.GroupBy(r => new
{
product_id = r.Field< String>(product_id),
owner_org_id = r.Field< String>(owner_org_id )
});
var tblUniques = groups
.Where(grp => grp.Count()== 1)
.Select(grp => grp.Single())
。 CopyToDataTable();
var tblDuplicates = groups
.Where(grp => grp.Count()> 1)
.SelectMany(grp => grp)
.CopyToDataTable();
I have a data table I've loaded from a CSV file. I need to determine which rows are duplicates based on two columns (product_id
and owner_org_id
) in the datatable. Once I've determined that, I can use that information to build my result, which is a datatable containing only the rows that are not unique, and a data table containing only the rows that are unique.
I've looked at other examples on here and the code I've come up with so far does compile and execute, but it seems to think every row in the data is unique. In reality in the test data there's 13 rows and only 6 are unique. So clearly I'm doing something wrong.
EDIT: Thought I should note, rows that have duplicates should ALL be removed, not just the duplicates of that row. eg if there are 4 duplicates, all 4 should be removed not 3, leaving one unique row from the 4.
EDIT2: Alternatively, if I can select all duplicate rows (instead of trying to select unique rows) it is fine with me. Either way can get me to my end result.
The code in the processing method:
MyRowComparer myrc = new MyRowComparer();
var uniquerows = dtCSV.AsEnumerable().Distinct(myrc);
along with the following:
public class MyRowComparer : IEqualityComparer<DataRow>
{
public bool Equals(DataRow x, DataRow y)
{
//return ((string.Compare(x.Field<string>("PRODUCT_ID"), y.Field<string>("PRODUCT_ID"), true)) ==
// (string.Compare(x.Field<string>("OWNER_ORG_ID"), y.Field<string>("OWNER_ORG_ID"), true)));
return
x.ItemArray.Except(new object[] { x[x.Table.Columns["PRODUCT_ID"].ColumnName] }) ==
y.ItemArray.Except(new object[] { y[y.Table.Columns["PRODUCT_ID"].ColumnName] }) &&
x.ItemArray.Except(new object[] { x[x.Table.Columns["OWNER_ORG_ID"].ColumnName] }) ==
y.ItemArray.Except(new object[] { y[y.Table.Columns["OWNER_ORG_ID"].ColumnName] });
}
public int GetHashCode(DataRow obj)
{
int y = int.Parse(obj.Field<string>("PRODUCT_ID"));
int z = int.Parse(obj.Field<string>("OWNER_ORG_ID"));
int c = y ^ z;
return c;
}
}
You could use LINQ-To-DataSet and Enumerable.Except
/Intersect
:
var tbl1ID = tbl1.AsEnumerable()
.Select(r => new
{
product_id = r.Field<String>("product_id"),
owner_org_id = r.Field<String>("owner_org_id"),
});
var tbl2ID = tbl2.AsEnumerable()
.Select(r => new
{
product_id = r.Field<String>("product_id"),
owner_org_id = r.Field<String>("owner_org_id"),
});
var unique = tbl1ID.Except(tbl2ID);
var both = tbl1ID.Intersect(tbl2ID);
var tblUnique = (from uniqueRow in unique
join row in tbl1.AsEnumerable()
on uniqueRow equals new
{
product_id = row.Field<String>("product_id"),
owner_org_id = row.Field<String>("owner_org_id")
}
select row).CopyToDataTable();
var tblBoth = (from bothRow in both
join row in tbl1.AsEnumerable()
on bothRow equals new
{
product_id = row.Field<String>("product_id"),
owner_org_id = row.Field<String>("owner_org_id")
}
select row).CopyToDataTable();
Edit: Obviously i've misunderstood your requirement a little bit. So you only have one DataTable
and want to get all unique and all duplicate rows, that's even more straight-forward. You can use Enumerable.GroupBy
with an anonymous type containing both fields:
var groups = tbl1.AsEnumerable()
.GroupBy(r => new
{
product_id = r.Field<String>("product_id"),
owner_org_id = r.Field<String>("owner_org_id")
});
var tblUniques = groups
.Where(grp => grp.Count() == 1)
.Select(grp => grp.Single())
.CopyToDataTable();
var tblDuplicates = groups
.Where(grp => grp.Count() > 1)
.SelectMany(grp => grp)
.CopyToDataTable();
这篇关于确定数据表中的重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!