从Python中的大文件中删除重复的行 [英] Remove duplicate rows from a large file in Python

查看:819
本文介绍了从Python中的大文件中删除重复的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个csv文件,我想从中删除重复的行,但它太大,无法适应内存。我找到了一种方法来完成,但我猜测这不是最好的方法。



每行包含15个字段和数百个字符,所有字段都需要确定唯一性。为了比较整个行以找到一个重复的代码,我正在比较 hash(row-as-a-string)以尝试节省内存。我设置一个过滤器,将数据分成大致相等数量的行(例如一周中的几天),每个分区足够小,以使该分区的哈希值查找表适合内存。我为每个分区传递一次文件,检查唯一的行并将其写入第二个文件(伪代码):

  import csv 

headers = {'DayOfWeek':无,'a':无,'b':无}
出发= csv.DictWriter(打开('c:\dedupedFile .csv','wb')
days = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']

outs.writerows(headers)

(天数)
htable = {}
ins = csv.DictReader(open('c:\bigfile.csv' 'rb'),标题)
对于行内:
hvalue = hash(reduce(lambda x,y:x + y,line.itervalues()))
if line [ DayOfWeek'] == day:
如果htable中的hval:
pass
else:
htable [hvalue] =无
outs.writerow(line)

我正在考虑加快速度的一个方法是找到一个更好的过滤器来减少必要的通行数。 ssuming行的长度是均匀分布的,可能代替

 天数:

  if line [ DayOfWeek'] ==天:

我们有



$($)

  if len(reduce(lambda x,y:x + y,line.itervalues())%n)== i :

其中'n'与内存一样小。但是这仍然使用相同的方法。



在下面提供了一个很好的实践解决方案;如果从算法角度来看有更好/更快/更简单的方法,我很好奇。



P.S。我只限于Python 2.5。

解决方案

如果你想要一个非常简单的方法来做到这一点,只需创建一个sqlite数据库:

  import sqlite3 
conn = sqlite3.connect('single.db')
cur = conn。 cursor()
cur.execute(create table test(
f1 text,
f2 text,
f3 text,
f4 text,
f5文本,
f6文本,
f7文本,
f8文本,
f9文本,
f10文本,
f11文本,
f12文本,
f13文本,
f14文本,
f15文本,
主键(f1,f2,f3,f4,f5,f6,f7,
f8, f9,f10,f11,f12,f13,f14,f15))

conn.commit()

#简化/伪代码
在读者中:
#assuming row返回一个列表类型的对象
try:
cur.execute('''插入到测试值(??,?,?,?,?,
conn.commit()
除了IntegrityError:

?,?,?,?

conn.commit()
cur.execute('select * from test')

用于行中的行:
#write行到csv文件

然后,您不必担心任何比较逻辑,只要让sqlite为你照顾它。它可能不会比哈希字符串快得多,但这可能要容易得多。当然,如果需要,您可以修改存储在数据库中的类型,或者视情况而定。由于您已经将数据转换为字符串,所以您可以只有一个字段。这里有很多选择。


I've a csv file that I want to remove duplicate rows from, but it's too large to fit into memory. I found a way to get it done, but my guess is that it's not the best way.

Each row contains 15 fields and several hundred characters, and all fields are needed to determine uniqueness. Instead of comparing the entire row to find a duplicate, I'm comparing hash(row-as-a-string) in an attempt to save memory. I set a filter that partitions the data into a roughly equal number of rows (e.g. days of the week), and each partition is small enough that a lookup table of hash values for that partition will fit in memory. I pass through the file once for each partition, checking for unique rows and writing them out to a second file (pseudo code):

import csv

headers={'DayOfWeek':None, 'a':None, 'b':None}
outs=csv.DictWriter(open('c:\dedupedFile.csv','wb')
days=['Mon','Tue','Wed','Thu','Fri','Sat','Sun']

outs.writerows(headers)

for day in days:
    htable={}
    ins=csv.DictReader(open('c:\bigfile.csv','rb'),headers)
    for line in ins:
        hvalue=hash(reduce(lambda x,y:x+y,line.itervalues()))
        if line['DayOfWeek']==day:
            if hvalue in htable:
                pass
            else:
                htable[hvalue]=None
                outs.writerow(line)

One way I was thinking to speed this up is by finding a better filter to reduce the number of passes necessary. Assuming the length of the rows is uniformly distributed, maybe instead of

for day in days: 

and

if line['DayOfWeek']==day:

we have

for i in range(n):

and

if len(reduce(lambda x,y:x+y,line.itervalues())%n)==i:

where 'n' as small as memory will allow. But this is still using the same method.

Wayne Werner provided a good practical solution below; I was curious if there was better/faster/simpler way to do this from an algorithm perspective.

P.S. I'm limited to Python 2.5.

解决方案

If you want a really simple way to do this, just create a sqlite database:

import sqlite3
conn = sqlite3.connect('single.db')
cur = conn.cursor()
cur.execute("""create table test(
f1 text,
f2 text,
f3 text,
f4 text,
f5 text,
f6 text,
f7 text,
f8 text,
f9 text,
f10 text,
f11 text,
f12 text,
f13 text,
f14 text,
f15 text,
primary key(f1,  f2,  f3,  f4,  f5,  f6,  f7,  
            f8,  f9,  f10,  f11,  f12,  f13,  f14,  f15))
"""
conn.commit()

#simplified/pseudo code
for row in reader:
    #assuming row returns a list-type object
    try:
        cur.execute('''insert into test values(?, ?, ?, ?, ?, ?, ?, 
                       ?, ?, ?, ?, ?, ?, ?, ?)''', row)
        conn.commit()
    except IntegrityError:
        pass

conn.commit()
cur.execute('select * from test')

for row in cur:
    #write row to csv file

Then you wouldn't have to worry about any of the comparison logic yourself - just let sqlite take care of it for you. It probably won't be much faster than hashing the strings, but it's probably a lot easier. Of course you'd modify the type stored in the database if you wanted, or not as the case may be. Of course since you're already converting the data to a string you could just have one field instead. Plenty of options here.

这篇关于从Python中的大文件中删除重复的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆