如何在大数据文件中使用pandas删除重复的行? [英] How to drop duplicated rows using pandas in a big data file?

查看:319
本文介绍了如何在大数据文件中使用pandas删除重复的行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个csv文件太大,无法加载到内存中。我需要删除重复的文件行。我这样做:

  chunker = pd.read_table(AUTHORS_PATH,names = ['Author ID','Author name'],encoding ='utf-8',chunksize = 10000000)

in chunker:
chunk.drop_duplicates(['Author ID'])



有没有更好的方法?



首先,创建你的chunker。


$

b $ b

  chunker = pd.read_table(AUTHORS_PATH,names = ['Author ID','Author name'],encoding ='utf-8',chunksize = 10000000)

现在创建一组id:

  ids = set()

现在遍历这些块: / p>

 用于chunker中的chunk:
chunk.drop_duplicates(['Author ID'])
$ b


$ b

  chunk = chunk [〜chunk ['Author ID']。isin(ids)] 

最后,仍然在循环体内,添加新的ids

  ids.update(chunk ['Author ID']。values)






如果 ids 太大,无法放入主内存,您可能需要使用一些基于磁盘的数据库。


I have a csv file that too big to load to memory.I need to drop duplicated rows of the file.So I follow this way:

chunker = pd.read_table(AUTHORS_PATH, names=['Author ID', 'Author name'],      encoding='utf-8', chunksize=10000000)

for chunk in chunker:
    chunk.drop_duplicates(['Author ID'])

But if duplicated rows distribute in different chunk seems like above script can't get the expected results.

Is there any better way?

解决方案

You could try something like this.

First, create your chunker.

chunker = pd.read_table(AUTHORS_PATH, names=['Author ID', 'Author name'], encoding='utf-8', chunksize=10000000)

Now create a set of ids:

ids = set()

Now iterate over the chunks:

for chunk in chunker:
    chunk.drop_duplicates(['Author ID'])

However, now, within the body of the loop, drop also ids already in the set of known ids:

    chunk = chunk[~chunk['Author ID'].isin(ids)]

Finally, still within the body of the loop, add the new ids

    ids.update(chunk['Author ID'].values)


If ids is too large to fit into main memory, you might need to use some disk-based database.

这篇关于如何在大数据文件中使用pandas删除重复的行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆