在csv文件中标记重复 [英] marking duplicates in a csv file

查看:395
本文介绍了在csv文件中标记重复的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到以下示例中所示的问题:

 ID,NAME,PHONE ,REF,DISCARD
1,JOHN,12345 ,,
2,PETER,6232 ,,
3,JON,12345 ,,
4,PETERSON,6232 ,,
5,ALEX,7854 ,,
6,JON,12345 ,,
pre>

我要检测列PHONE中的重复项,并使用列REF标记后续重复项,其值指向DISCARD列

的值为Yes。

 ID,NAME,PHONE REF,DISCARD
1,JOHN,12345,1,
2,PETER,6232,2,
3,JON,12345,1,是
4,PETERSON,6232,2,是
5,ALEX,7854 ,,
6,JON,12345,1,是 b $ b

那么,我该怎么办呢?
我试过这个代码,但我的逻辑是不对的,当然。

  import csv 
myfile = open(C:\Users\Eduardo\Documents\TEST2.csv,rb)
myfile1 = open(C:\Users\Eduardo\Documents\TEST2。 csv,rb)

dest = csv.writer(open(C:\Users\Eduardo\Documents\TESTFIXED.csv,wb),dialect = excel)

reader = csv.reader(myfile)
verum = list(reader)
verum.sort(key = lambda x:x [2])
for i,enumerate(verum)中的行:
if row [2] == verum [i] [2]:
verum [i] [3] = row [0]

print verum

您的指导和帮助将非常感激。

解决方案

在运行时,你必须在内存中保存的唯一一个地址是电话号码到其ID的地图。

  map = {} 
打开(r'c:\temp\input.csv','r')as fin:
reader = csv.reader(fin)
with open(r'c:\temp\output.csv','w')as fout:
writer = csv.writer
#如果文件没有标题行,省略此操作
writer.writerow(下一个(阅读器))
读取器中的行:
(id,name,phone,ref,discard )= row
if map.has_key(phone):
ref = map [phone]
discard =YES
else:
map [phone] = id
writer.writerow((id,name,phone,ref,discard))


I'm stumped with a problem illustrated in the sample below:

"ID","NAME","PHONE","REF","DISCARD"
1,"JOHN",12345,,
2,"PETER",6232,,
3,"JON",12345,,
4,"PETERSON",6232,,
5,"ALEX",7854,,
6,"JON",12345,,

I want to detect duplicates in column "PHONE", and mark the subsequent duplicates using the column "REF", with a value pointing to the "ID" of the first item and the value "Yes" for the "DISCARD" column

"ID","NAME","PHONE","REF","DISCARD"
1,"JOHN",12345,1,
2,"PETER",6232,2,
3,"JON",12345,1,"Yes"
4,"PETERSON",6232,2,"Yes"
5,"ALEX",7854,,
6,"JON",12345,1,"Yes"

So, how do I go about it? I tried this code but my logic wasn't right, of course.

import csv
myfile = open("C:\Users\Eduardo\Documents\TEST2.csv", "rb")
myfile1 = open("C:\Users\Eduardo\Documents\TEST2.csv", "rb")

dest = csv.writer(open("C:\Users\Eduardo\Documents\TESTFIXED.csv", "wb"), dialect="excel")

reader = csv.reader(myfile)
verum = list(reader)
verum.sort(key=lambda x: x[2])
for i, row in enumerate(verum):
    if row[2] == verum[i][2]:
        verum[i][3] = row[0]

print verum

Your direction and help would be much appreciated.

解决方案

The only thing you have to keep in memory while this is running is a map of phone numbers to their IDs.

map = {}
with open(r'c:\temp\input.csv', 'r') as fin:
    reader = csv.reader(fin)
    with open(r'c:\temp\output.csv', 'w') as fout:
        writer = csv.writer(fout)
        # omit this if the file has no header row
        writer.writerow(next(reader))
        for row in reader:
            (id, name, phone, ref, discard) = row
            if map.has_key(phone):
                ref = map[phone]
                discard = "YES"
            else:
                map[phone] = id
            writer.writerow((id, name, phone, ref, discard))

这篇关于在csv文件中标记重复的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆