在文件中查找重复条目并执行正确的操作 [英] Find duplicate entries in file and perform correct action

查看:80
本文介绍了在文件中查找重复条目并执行正确的操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含9列的.csv文件,我需要整理出来。我需要检查一列中的重复项,如果我发现任何我需要比较重复项,并根据最旧哪一行决定删除哪一行,其中一列包含日期时间数据。



即使用最直接的方法迭代所有行并将它们全部相互比较,我也不确定如何继续进行。性能方面这对我来说应该不是问题。



大多数问题来自处理文件io。但是我正在考虑读取文件,并且每个行存储整个行在一个集合中,然后只是一个集合中的时间列,最后是我将在一个集合中执行重复检查的列。只要它们以相同的方式编制索引,就应该足够简单地进行时间比较,然后删除正确的行并保存回文件。



我觉得这个解决方案效率非常低,应该有更好的方法。



我认为我应该做的一件简单的事情就是确保文件按照我要进行重复检查的列进行排序,因为我必须检查的是下一个值,如果它是等于或不等于前一个,然后转到下一个但是这让我不得不在最后对列表进行排序。



关于如何使用的任何提示处理这个的好方法是值得赞赏的。我对VB6很陌生,文件和字符串操作都是我不太擅长的。我的水平是保存附加某种标签,然后只保存行到文件,当读回文件时读取标签,通常做非常简单的操作。



自这是关于Chill60建议的另一个问题的解决方法我也包括原始问题。



我正在尝试格式化的数据包含在三个不同的表中,并且是一个共9个字段。我需要做的是检查一列中的重复项,如果我找到任何选择最新的条目。由于我目前的SQL技能与我的VB6技能处于同一水平,这变成了一场噩梦。意味着只有2个月的splotchy经验。



  SELECT  Stamps.Stampnr  as  Stampnr,Stamps。 Time   as  < span class =code-string>'  Time',Stamps.amount  as  < span class =code-string>'  Amount',Products.Productname  as  < span class =code-string>'  Productname',Products.Articelnumber  as  < span class =code-string>'  Articlenumber',FlagContainer.id  as  < span class =code-string>'  FlagId',FlagContainer.FlagId  as  < span class =code-string>' 标记',Process.prnr  as  '  CurrentPrNr',Process.numProcesses  as  '  ProcessNumbers 
FROM 标记 INNER JOIN
process ON Stamps.prnr = Process.prnr INNER JOIN
产品 ON Process.productnr = Products.productnr INNER JOIN
Flagcontainer ON Stamps.ID = Flagcontainer.id
WHERE (邮票。时间> ' & dtmY yesterday&' + ' 06:00:00'
(邮票。时间< ' & dtmNow&' + ' 06:00:00'
Flagcontainer.flagid = 5 订单 通过 FlagId





对于名为flagid的表格看起来有点奇怪但是我从瑞典语翻译并尝试制作它尽可能可读。



我正在寻找重复的列是一个名为Flagcontainer.id的列,然后使用Stamps.Time选择最旧的。

解决方案

如果你绝对必须使用VB6(参见上面的评论 - 哦,我刚刚看到你的回复!)那么这个方法应该对你有用...我道歉但是我无法测试任何这个,因为我不再有VB6。



将整个CSV文件读入 RecordSet 。这个链接 [ ^ ]可以帮助您。顺便提一下,这个站点(与我无关)对于在VB6中查找代码片段非常方便(至少在它消失之前)。我正在复制这里的代码以防将来链接中断...

  Dim  connCSV 作为  ADODB.Connection 
Dim rsTest As ADODB.Recordset
Dim adcomm 作为 ADODB.Command
Dim path As String

path = C:\ Testdir \ ' 此处测试目录是文本文件所在的目录
。不要在这里写文件名。

' 这是连接没有标题的文本文件

' connCSV.OpenProvider = Microsoft。 Jet.OLEDB.4.0;数据源=_
&路径& ; Extended Properties ='text; HDR = NO; FMT = Delimited'


' 这是带有标题的文本文件的连接(即列

connCSV.Open Driver = {Microsoft Text Driver(* .txt; * .csv) }; Dbq = _
& path& ; Extensions = asc,csv ,tab,txt; HDR = NO; Persist Security Info = False


rsTest.Open < span class =code-string> Select * From test.txt,_
connCSV,adOpenStatic,adLockReadOnly,adCmdText
while rsTest.EOF
MsgBox rsTest( 0 ' 您可以选择所需数据
rsTest.movenext
循环

' 如果你想测试这个,
' 将关注点保存到C:\TESTDIR \TEST.TXT
' < span class =code-comment>使用两个不同的代码运行上面的代码
' 连接开放语句

' 姓名,地址,城市,州,邮编
' John,Doe,NY,NY,910

注意它还显示了如何迭代数据。



记录集可以排序 - 但根据数据内容可能会出现问题 - 如果你有问题,那么看看这个建议关于链接 [ ^ ]。您可以使用您的选择标准循环遍历数据集,尝试使用Distinct做一些聪明的事情,或尝试使用字典方法提倡这里 [ ^ ](这是我无法测试任何东西使得这有点模糊 - 对不起)。



我建议有一个第二个数据集(相同的模式)来复制你想要的记录,然后保存,或者你完成的任何东西,例如保存为CSV [ ^ ]



说了这么多,如果你有权访问数据库(例如MS Access),那么可能值得将数据保存到数据库中并使用数据库函数来操纵信息。



如果您遇到任何问题,请稍后再回来





如果将当前查询放入CTE(公用表表达式)中,则可以将重复数据删除作为针对该CTE的后续查询(请参阅 SQL SERVER 2008中的常用表格表达式(CTE) [ ^ ]有更详细的解释)



例如:



  - 假设这些传递在
DECLARE @dtmY yesterday DATETIME
DECLARE @dtmNow DATETIME
DECLARE @flagid int

- 测试数据
SET @dtmY yesterday = dateadd(dd,datediff(dd,0,getdate()) - 1,0)
SET @dtmNow = dateadd(dd,datediff(dd,0,getdate()),0)

- 日期被传递为午夜(基于我所看到的)
- 所以设置为06:00小时
SET @dtmY yesterday = dateadd(hh,6,@ dtmY yesterday)
SET @dtmNow = dateadd(hh,6,@ dtmNow)
SET @flagid = 5

; WITH CTE AS

SELECT
Stamps.Stampnr as Stampnr,Stamps.Time as'Time',Stamps.amount as'Anount',
Products.Productname为'Productname',Products.Articelnumber为'Articlenumber',
FlagContainer.id为'FlagId',FlagContainer.FlagId为'Flag',
Process.prnr为CurrentPrNr',Process.numProcesses为'ProcessNumbers'
FROM
邮票
INNER JOIN流程ON Stamps.prnr = Process.prnr
INNER JOIN产品ON Process.productnr = Products.productnr
INNER JOIN Flagcontainer ON Stamps.ID = Flagcontainer.id
WHERE
Stamps.Time> @dtmY yesterday
和Stamps.Time< @dtmNow
和Flagcontainer.flagid = @flagid

选择CTE。*来自CTE
内连接(SELECT FlagId,Productname,MIN([Time])as mintime FROM CTE GROUP BY FlagId,ProductName)A
ON CTE.FlagId = A.FlagId AND CTE.Productname = A.Productname
AND CTE。[Time] = A.mintime
ORDER BY FlagId,Productname

这可能需要调整以达到您的目的,因为我使用了FlagContainer.id和Products.Productname来驱动查询。我假设如果有多个id + Productname配对,则所有其余数据必须来自最旧的条目 - 如果您不需要,只需从第二个查询中删除Productname,或者如果您不需要添加其他列需要它们。



还要注意我使用本地(sql)变量的方式 - 理想情况下你会把它放到一个接受这些参数的存储过程中。

I got a .csv file containing 9 columns that I need to sort out. I need to check for duplicates in one column and if I find any I need to compare the duplicates and depending on which row is the oldest decide which one to delete, one of the columns contains datetime data.

I'm not really sure on how to proceed even with the most straight forward method of iterating through all rows and compare them all to each other. Performance wise this shouldn't be a problem for me.

Most of the problems comes from dealing with the file io. But I'm considering reading the file and for each line store the entire row in one collection, then just the time column in one collection and lastly the column which I will do the duplicate check on in one. As long as they are indexed the same way it should be simple enough to do the time comparison and then remove the correct row and save back to file.

I just feel as if this solution is highly inefficient and that there should be some better way of doing it.

One simple thing that I think I should be able to do is making sure that the file is sorted by the column where I'll do the duplicate check because then all I've got to check is the next value if it's equal or not to the previous before moving on to the next but this leaves me with having to sort the list a new at the end instead.

Any tips on how a good way to dealing with this is appreciated. I'm quite new to VB6 and both fileio and string manipulation is something I'm not very good at. My level is at saving appending some sort of tag and then just save line to file and when reading back the file read the tag and usually do very simple operations.

Since this was a workaround to another problem on Chill60's suggestion I include the original problem too.

The data I'm trying to format is contained in three different tables and is a total of 9 fields. What I need to do is to check for duplicates in one column and if I find any select the newest entry. This turned in to a nightmare because of my current SQL skills which is at the same level as my VB6 skills. Meaning barely 2 months of splotchy experience.

SELECT Stamps.Stampnr as Stampnr, Stamps.Time as 'Time', Stamps.amount as 'Amount', Products.Productname as 'Productname', Products.Articelnumber as 'Articlenumber', FlagContainer.id as 'FlagId', FlagContainer.FlagId as 'Flag', Process.prnr as 'CurrentPrNr', Process.numProcesses as 'ProcessNumbers' 
FROM     Stamps INNER JOIN 
process ON Stamps.prnr = Process.prnr INNER JOIN 
Products ON Process.productnr = Products.productnr INNER JOIN 
Flagcontainer ON Stamps.ID = Flagcontainer.id 
WHERE  (Stamps.Time > '" & dtmYesterday & "' + ' 06:00:00') 
and (Stamps.Time < '" & dtmNow & "' + ' 06:00:00') 
and Flagcontainer.flagid = 5 order by FlagId



It might look a bit weird with the tables called flagid but I translated from Swedish and tried to make it as readable as possible.

The column I'm looking for duplicates in is the one that's called Flagcontainer.id and then selecting the oldest using Stamps.Time.

解决方案

If you absolutely must use VB6 (see my comment above - oh and I've just seen your response!) then this method should work for you ... my apologies but I can't test any of this as I no longer have VB6.

Read the entire CSV file into a RecordSet. This link[^] should help you with that. Incidentally that site (which is nothing to do with me) is quite handy for finding code snippets in VB6 (at least until it disappears). I'm reproducing the code here in case the link breaks in the future...

Dim connCSV As New ADODB.Connection
Dim rsTest As New ADODB.Recordset
Dim adcomm As New ADODB.Command
Dim path As String

path = "C:\Testdir\"  'Here Test dir is the Directory where
' the text file is located. don't write the file name here.

'This is connection for a text file without Header

 'connCSV.Open "Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" _
 & path & ";Extended Properties='text;HDR=NO;FMT=Delimited'"


'This is connection for a text file with Header (i.e., columns
 
connCSV.Open "Driver={Microsoft Text Driver (*.txt; *.csv)};Dbq=" _
& path & ";Extensions=asc,csv,tab,txt;HDR=NO;Persist Security Info=False"
    
    
   rsTest.Open "Select * From test.txt", _
       connCSV, adOpenStatic, adLockReadOnly, adCmdText
Do While Not rsTest.EOF
MsgBox rsTest(0)   'You can select the required data
rsTest.movenext
Loop

'IF YOU WANT TO TEST THIS,
'SAVE THE FOLLOWINT TO C:\TESTDIR\TEST.TXT
'AND RUN THE ABOVE CODE WITH THE TWO DIFFERENT
'CONNECTION OPEN STATEMENTS

'Name,Address,City,State,Zip
'John , Doe, NY, NY, 910

Note it also shows how you can iterate through the data.

A recordset can be sorted - but there may be issues depending on the content of the data - if you have problems then have a look at the suggestions on this link[^]. You can either loop through the dataset applying your selection criteria, try to do something clever with Distinct or try the dictionary approach advocated here[^] (This is where my lack of being able to test anything is making this a bit vague - sorry).

I would suggest have a 2nd dataset (same schema) to copy the records you want into which can then be saved or whatever when you are complete e.g. save to CSV[^]

Having said all that, if you have access to a database (e.g. MS Access) then it might be worth saving the data into that and using database functions to manipulate the information.

Have a crack at it then come back if you hit any issues

[EDIT - alternative - a suggested method for de-duplicating on the database side]
If you put your current query into a CTE (Common Table Expression) you can do the de-duplication as a subsequent query against that CTE (see Common Table Expressions(CTE) in SQL SERVER 2008[^] for a more detailed explanation)

For example:

-- Assume these are passed in
DECLARE @dtmYesterday DATETIME
DECLARE @dtmNow DATETIME
DECLARE @flagid int

-- test data
SET @dtmYesterday = dateadd(dd, datediff(dd, 0, getdate()) - 1, 0)
SET @dtmNow = dateadd(dd, datediff(dd, 0, getdate()), 0)

-- Date's are passed is as midnight (based on what I saw)
-- so set to 06:00 hours
SET @dtmYesterday = dateadd(hh, 6, @dtmYesterday)
SET @dtmNow = dateadd(hh, 6, @dtmNow)
SET @flagid = 5

;WITH CTE AS
( 
	SELECT 
		Stamps.Stampnr as Stampnr, Stamps.Time as 'Time', Stamps.amount as 'Amount', 
		Products.Productname as 'Productname', Products.Articelnumber as 'Articlenumber', 
		FlagContainer.id as 'FlagId', FlagContainer.FlagId as 'Flag', 
		Process.prnr as 'CurrentPrNr', Process.numProcesses as 'ProcessNumbers' 
	FROM     
		Stamps 
		INNER JOIN process ON Stamps.prnr = Process.prnr 
		INNER JOIN Products ON Process.productnr = Products.productnr 
		INNER JOIN Flagcontainer ON Stamps.ID = Flagcontainer.id 
	WHERE  
		Stamps.Time > @dtmYesterday
		and Stamps.Time < @dtmNow 
		and Flagcontainer.flagid = @flagid 
)
select CTE.* from CTE
inner join (SELECT FlagId, Productname, MIN([Time]) as mintime FROM CTE GROUP BY FlagId, ProductName) A
	ON CTE.FlagId=A.FlagId AND CTE.Productname = A.Productname
	AND CTE.[Time] = A.mintime
ORDER BY FlagId, Productname

This might need tweaking for your purposes as I used both FlagContainer.id and Products.Productname to drive query. I assumed that if there were more than one id + Productname pairing then all the rest of the data had to come from the oldest entry - just remove Productname from the second query if you don't need it, or add in other columns if you need them.

Also note the way I've used local (sql) variables - ideally you would put this into a Stored Procedure that accepts those arguments.


这篇关于在文件中查找重复条目并执行正确的操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆