Python删除重复项。 [英] Python removing duplicates.

查看:93
本文介绍了Python删除重复项。的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试获取代码以删除.csv文件的重复位置。可以在 http://www.sharecsv.com/s/29ae855f20472de54b12fa66bbe3cbb9/DBA中找到。 csv

I'm trying to get my code to remove duplicate inplace for a .csv file. Which can be found http://www.sharecsv.com/s/29ae855f20472de54b12fa66bbe3cbb9/DBA.csv

我对应该做什么提出了建议,最终得到了如下代码:

I got a suggestion on what to do and ended up with code looking like this:

def deleteDuplicate():
    seen = set()
    dupeCount = 0
    counter = 0
    with FileInput('DBA.csv', inplace=1) as f:
        f, f_orig = tee(f)
        for row, line in zip(csv.reader(f), f_orig):
            if row[2] in seen:
                dupeCount+=1
                continue
            seen.add(row[2])
            counter+=1
            print(line, end='')
        print(counter)
        print("Removed {} Duplicates".format(dupeCount))

上面的代码非常适合在较小的测试规模上删除重复项,例如:

The above code works perfect for removing duplicates on a smaller test scale like:

null,first,second,third
zero,one,two,three
null,first,second,third
nul,un,deux,trois
0,"1,one",2,3

当我在较大的.csv文件上运行该文件时,它可以很好地删除重复项,但最终又删除了4行。删除的4行不会在我的dupeCount中进行跟踪,因此不应触发它们的if语句。

When i run it on my larger .csv file it removes the duplicates perfectly fine, but ends up removing additionally 4 rows. The 4 rows removed doesn't get tracked in my dupeCount so they are not supposed to trigger my if statement.

我必须承认,我不太确定itertools中使用tee()的用途是什么,以及为什么有好处。

I must admit that I'm not quite sure what the usage of tee() from itertools is used for and why it's beneficial.

我的2个问题是:
为什么deleteDuplicate()会删除较大的.csv文件中的4行,为什么还要使用tee()和zip? p>

My 2 questions are: Why does deleteDuplicate() remove 4 rows in the larger .csv file and why is tee() and zip used?

推荐答案

查看数据的第一行,描述中包含换行符'\n'(以及逗号),因此我们有7条行数据

Look at the first rows of the data, the description has newlines in it '\n' (as well as commas) so we have 7 "lines" of data

Date,Price DKK,URL,Description
19/5,1 kr.,http://www.dba.dk/8660-vegavej-1-14/id-102010171/,"8660, Vegavej 1-14, hel�rsgrund, Boligprojekt s�lges 1-14 boliger
R�kkehusene ligger ud til et stort smukt fredet omr�de. Alle boliger har private sydvendte haver, som ligger direkte ud til et f�lles omr�de. Der er altan, hvorfra der er udsigt over det facinerende og karakteristiske landskab med �l�b, heste, gravh�j.
Aktiv fritid og lokalmilj�.
Tebstrup er en lille landsby med 660 indbyggere. I byen er der skole, b�rnehave m.m
se"
19/5,1.599.000 kr.,http://www.dba.dk/7800-4-103-372-2013/id-93506363/,"7800 4, 103, 372, 2013, Fyrt�jet 8, 7656, 6130, 80000, Villa"

但是如果使用csv(和excel)阅读换行符被引号括起来,因此该行上只有一个单元格。

But if read it with csv (and excel) the newlines are encapsluated by the quotes so its only one cell on that row.

with open("output.csv") as f : 
    for row in csv.reader(f):
        print( row )  

['Date', 'Price DKK', 'URL', 'Description']
['19/5', '1 kr.', 'http://www.dba.dk/8660-vegavej-1-14/id-102010171/', '8660, Vegavej 1-14, hel\xef\xbf\xbdrsgrund, Boligprojekt s\xef\xbf\xbdlges 1-14 boliger\r\nR\xef\xbf\xbdkkehusene ligger ud til et stort smukt fredet omr\xef\xbf\xbdde. Alle boliger har private sydvendte haver, som ligger direkte ud til et f\xef\xbf\xbdlles omr\xef\xbf\xbdde. Der er altan, hvorfra der er udsigt over det facinerende og karakteristiske landskab med \xef\xbf\xbdl\xef\xbf\xbdb, heste, gravh\xef\xbf\xbdj.\r\nAktiv fritid og lokalmilj\xef\xbf\xbd.\r\nTebstrup er en lille landsby med 660 indbyggere. I byen er der skole, b\xef\xbf\xbdrnehave m.m\r\nse']
['19/5', '1.599.000 kr.', 'http://www.dba.dk/7800-4-103-372-2013/id-93506363/', '7800 4, 103, 372, 2013, Fyrt\xef\xbf\xbdjet 8, 7656, 6130, 80000, Villa']

文件行可能不等于csv数据中的行。

lines of the file may not equal rows in the csv data.

编辑

添加到测试文件中以确认您可能看到的内容。

Additions to your test file to confirm what you may be seeing.

null,first,second,third
zero,one,two,"three
,four
five\r\n"
null,first,second,third
nul,un,deux,trois
0,"1,one",2,3

这篇关于Python删除重复项。的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆