如何在大数据文件中使用pandas删除重复的行? [英] How to drop duplicated rows using pandas in a big data file?
问题描述
我有一个csv文件太大,无法加载到内存中。我需要删除重复的文件行。我这样做:
chunker = pd.read_table(AUTHORS_PATH,names = ['Author ID','Author name'],encoding ='utf-8',chunksize = 10000000)
in chunker:
chunk.drop_duplicates(['Author ID'])
有没有更好的方法?
首先,创建你的chunker。
$ b $ b
chunker = pd.read_table(AUTHORS_PATH,names = ['Author ID','Author name'],encoding ='utf-8',chunksize = 10000000)
现在创建一组id:
ids = set()
现在遍历这些块: / p>
用于chunker中的chunk:
chunk.drop_duplicates(['Author ID'])
$然而,现在,在循环体内,删除已经在已知ID集合中的ids: $ b
$ b chunk = chunk [〜chunk ['Author ID']。isin(ids)]
最后,仍然在循环体内,添加新的ids
ids.update(chunk ['Author ID']。values)
如果 ids
太大,无法放入主内存,您可能需要使用一些基于磁盘的数据库。
I have a csv file that too big to load to memory.I need to drop duplicated rows of the file.So I follow this way:
chunker = pd.read_table(AUTHORS_PATH, names=['Author ID', 'Author name'], encoding='utf-8', chunksize=10000000)
for chunk in chunker:
chunk.drop_duplicates(['Author ID'])
But if duplicated rows distribute in different chunk seems like above script can't get the expected results.
Is there any better way?
解决方案 You could try something like this.
First, create your chunker.
chunker = pd.read_table(AUTHORS_PATH, names=['Author ID', 'Author name'], encoding='utf-8', chunksize=10000000)
Now create a set of ids:
ids = set()
Now iterate over the chunks:
for chunk in chunker:
chunk.drop_duplicates(['Author ID'])
However, now, within the body of the loop, drop also ids already in the set of known ids:
chunk = chunk[~chunk['Author ID'].isin(ids)]
Finally, still within the body of the loop, add the new ids
ids.update(chunk['Author ID'].values)
If ids
is too large to fit into main memory, you might need to use some disk-based database.
这篇关于如何在大数据文件中使用pandas删除重复的行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!