读取具有许多重复值的大型CSV文件,在读取时删除重复项 [英] Read large csv file with many duplicate values, drop duplicates while reading

查看:247
本文介绍了读取具有许多重复值的大型CSV文件,在读取时删除重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下熊猫代码段,可读取在.csv文件的特定列中找到的所有值.

I have the following pandas code snippet that reads all the values found in a specific column of my .csv file.

sample_names_duplicates = pd.read_csv(infile, sep="\t", 
                                      engine="c", usecols=[4],
                                      squeeze=True)

我文件的那个partualr列最多包含20个值(样本名称),所以如果我可以即时删除重复项而不是存储它们然后再删除重复项,则可能会更快.是否可以删除以某种方式找到的重复项?

That particualr column of my file contains perhaps 20 values at most (sample names), so it would probably be faster if I could drop the duplicates on the fly instead of storing them and then deleting the duplicates afterwards. Is this possible to delete duplicates as they are found in some way?

如果没有,是否有一种方法可以更快地执行此操作,而不必让用户明确命名文件中的样品名称是什么?

If not, is there a way to do this more quickly, without having to make the user explicitly name what the sample names in her file are?

推荐答案

由于read_csv()返回的结果是可迭代的,因此您可以将其包装在set()调用中以删除重复项.请注意,使用一组将失去您可能拥有的所有订购.如果要排序,则应使用list()sort()

As the result returned by read_csv() is an iterable, you could just wrap this in a set() call to remove duplicates. Note that using a set will loose any ordering you may have. If you then want to sort, you should use list() and sort()

独特的无序集示例:

sample_names_duplicates = set(pd.read_csv(infile, sep="\t", engine="c", usecols=[4], squeeze=True))

订购列表示例:

sample_names = list(set(pd.read_csv(infile, sep="\t", engine="c", usecols=[4], squeeze=True)))
sample_names.sort()

这篇关于读取具有许多重复值的大型CSV文件,在读取时删除重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆