Sql server查询返回6个lac记录,按列删除重复记录需要很长时间 [英] Sql server query returning 6 lac records, removing duplicate records by column taking long time

查看:88
本文介绍了Sql server查询返回6个lac记录,按列删除重复记录需要很长时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的SQL查询返回大约6个lac记录。从这些记录中,我必须删除其中一列具有重复值的行: AccountID 。我正在使用for循环遍历C#代码中的数据集,并通过检查列 AccountID 的值(如果已存在)来删除记录。这个过程需要很长时间。请建议根据列的值删除重复记录的其他方法: AccountID



我尝试过:



 public DataTable RemoveDuplicateRows(DataTable dTable,string colName)
{
Hashtable hTable = new Hashtable( );
ArrayList duplicateList = new ArrayList();

//将所有唯一项值的列表添加到哈希表,哈希表存储键,值对的组合。
//并在arraylist中添加重复的项值。
foreach(dTable.Rows中的DataRow drow)
{
if(hTable.Contains(drow [colName]))
duplicateList.Add(drow);
else
hTable.Add(drow [colName],string.Empty);
}

//从数据表中删除重复项的列表。
foreach(DataRow dRow in duplicateList)
dTable.Rows.Remove(dRow);

//包含唯一记录的数据表将作为输出返回。
返回dTable;
}

解决方案

Quote:

我的SQL查询返回大约6个lac记录。

你应该有充分的理由获得那么多记录;建立报告服务器?

Quote:

我必须删除其中一列有重复值的行:AccountID。

正如Maciej已经提到的,在数据​​库中执行此操作。原因很明显,

Quote:

我使用for循环和删除迭代C#代码中的数据集通过检查AccountID列的值(如果已存在)来记录

请自行查看,请求服务器为您提供6lac记录(每条记录需要时间,然后在网络上也需要时间,您可能正在使用它在本地,但您的用户将在远程连接上使用它)。完成后,您现在强制CPU找到重复项并删除它们。最糟糕的方法是编写SQL查询,然后查找并删除重复项。更好的方法是使用 SELECT DISTINCT ,或者使用 GROUP BY 将数据分组在一起。

Quote:

这个过程需要很长时间。请建议根据列的值删除重复记录的其他方法:AccountID。

是否要永久删除记录?我对此表示怀疑,因此您希望将数据汇总并且不重复。在这种情况下,请使用我上面提到的方法。



如果你想删除数据,并且不想允许重复数据(这不太可能,但仍然如此),那么你需要使用列作为表的主键。



您在上面的答案中看到了这种模式吗?我要求你让数据库引擎为你做所有这些事情。好处是数据库引擎会自动跳过它不需要返回的记录,而你的CPU只会渲染结果 - 不用担心过滤。



SELECT示例(Transact-SQL)| Microsoft Docs [ ^ ](参见 DISTINCT 示例)

GROUP BY(Transact-SQL)| Microsoft Docs [ ^ ]



快速分析您从此处复制的算法,< a href =https://stackoverflow.com/questions/4415519/best-way-to-remove-duplicate-entries-from-a-data-table> c# - 从数据表中删除重复条目的最佳方法 - 堆栈溢出 [ ^ ]你明显错过了比这个愚蠢的答案更好的答案,它发布在同一个帖子上, c# - 从数据表中删除重复条目的最佳方法 - Stack Overflow [ ^ ]

 public DataTable RemoveDuplicateRows(DataTable dTab le,string colName)
{
Hashtable hTable = new Hashtable();
ArrayList duplicateList = new ArrayList();

//将所有唯一项值的列表添加到哈希表,哈希表存储键,值对的组合。
//并在arraylist中添加重复的项值。
foreach(dTable.Rows中的DataRow drow)
{
if(hTable.Contains(drow [colName]))
duplicateList.Add(drow);
else
hTable.Add(drow [colName],string.Empty);
}

//从数据表中删除重复项的列表。
foreach(DataRow dRow in duplicateList)
dTable.Rows.Remove(dRow);

//包含唯一记录的数据表将作为输出返回。
返回dTable;
}

我会告诉你这不是一个好方法,因为你可以访问数据库,写出更好的查询。这种方法意味着您的6个lac记录将再次存储在内存中,然后对它们进行迭代以验证它们是否重复。



这也意味着,如果你的数据库表没有重复数据,那么这个算法仍然可以运行50万次,可能会存储每个实体。哈希表,更糟糕的。



我可以继续,但你明白了。 : - )


My SQL query is returning around 6 lac records. From these records I have to remove rows which Is having repeated value for one of the columns : AccountID. I am iterating through the dataset in C# code using for loop and deleting records by checking value of the column AccountID if it already exists. This process is taking very long time. Pls suggest alternate ways to remove the duplicate records based on the value of column: AccountID.

What I have tried:

public DataTable RemoveDuplicateRows(DataTable dTable, string colName)
{
   Hashtable hTable = new Hashtable();
   ArrayList duplicateList = new ArrayList();

   //Add list of all the unique item value to hashtable, which stores combination of key, value pair.
   //And add duplicate item value in arraylist.
   foreach (DataRow drow in dTable.Rows)
   {
      if (hTable.Contains(drow[colName]))
         duplicateList.Add(drow);
      else
         hTable.Add(drow[colName], string.Empty); 
   }

   //Removing a list of duplicate items from datatable.
   foreach (DataRow dRow in duplicateList)
      dTable.Rows.Remove(dRow);

   //Datatable which contains unique records will be return as output.
      return dTable;
}

解决方案

Quote:

My SQL query is returning around 6 lac records.

You should have a really good reason to get that much of records; building a reporting server?

Quote:

I have to remove rows which Is having repeated value for one of the columns : AccountID.

As Maciej already mentioned, do this operation in the database. The reason is quite obvious,

Quote:

I am iterating through the dataset in C# code using for loop and deleting records by checking value of the column AccountID if it already exists

See for yourself, you ask the server to give you 6lac records (each record takes time, and then on the network it takes time as well, you might be using it locally, but your users are going to have it on a remote connection). Once you are done with that you are now forcing your CPU to find the duplicates and remove them. Worst way to write a SQL query and then find and remove duplicates. A better way to do this would be either using SELECT DISTINCT, or by grouping the data together using GROUP BY.

Quote:

This process is taking very long time. Pls suggest alternate ways to remove the duplicate records based on the value of column: AccountID.

Do you want to permanently remove the records? I doubt that, so you want to get the data aggregated, and non-duplicated. In this case, please use the approach I mentioned above.

If you want to remove the data, and do not want to allow the duplicate data (which is unlikely, but still), then you need to use the column as a primary key for the table.

You see the pattern in the my answer above? I am asking you to make the database engine do all this stuff for you. The benefit is that database engine will automatically skip the records it does not need to return, and your CPU would only be rendering the results—not worrying about filtering.

SELECT Examples (Transact-SQL) | Microsoft Docs[^] (See the DISTINCT examples)
GROUP BY (Transact-SQL) | Microsoft Docs[^]

A quick analysis of your algorithm, that you copied from here, c# - Best way to remove duplicate entries from a data table - Stack Overflow[^] and you clearly missed a better answer than this stupid one, that was posted on the same thread, c# - Best way to remove duplicate entries from a data table - Stack Overflow[^]

public DataTable RemoveDuplicateRows(DataTable dTable, string colName)
{
   Hashtable hTable = new Hashtable();
   ArrayList duplicateList = new ArrayList();

   //Add list of all the unique item value to hashtable, which stores combination of key, value pair.
   //And add duplicate item value in arraylist.
   foreach (DataRow drow in dTable.Rows)
   {
      if (hTable.Contains(drow[colName]))
         duplicateList.Add(drow);
      else
         hTable.Add(drow[colName], string.Empty); 
   }

   //Removing a list of duplicate items from datatable.
   foreach (DataRow dRow in duplicateList)
      dTable.Rows.Remove(dRow);

   //Datatable which contains unique records will be return as output.
      return dTable;
}

I would tell you that this is not a good approach, since you have the access to the database, write better query. This approach means your 6 lac records be stored again in the memory, and then an iteration be done on them to verify if they are duplicate or not.

This also means, that if your database table has no duplicates, this algorithm is still going to run on it, half a million times, possibly storing every entity in the hash table, and worse.

I can go on, but you get the idea. :-)


这篇关于Sql server查询返回6个lac记录,按列删除重复记录需要很长时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆