有效地读取巨大的csv文件? [英] Reading huge csv files efficiently?
问题描述
我知道如何使用熊猫来读取具有CSV扩展名的文件.读取大文件时,出现内存不足错误.该文件为380万行和640万列.大种群档案中大部分是基因组数据.
I know how to use pandas to read files with CSV extension. When reading a large file i get an out of memory error. The file is 3.8 million rows and 6.4 millions columns file. There is mostly genome data in the file of large populations.
如何克服这个问题,什么是标准做法,以及如何为此选择合适的工具.我可以用熊猫处理这么大的文件吗,或者还有其他工具?
How can I overcome the problem, what is standard practice and how do I select the appropriate tool for this. Can I process a file this big with pandas, or there is another tool?
推荐答案
您可以使用Apache Spark分发csv文件的内存处理 https://github.com/databricks/spark-csv .看看 ADAM的用于分布式基因组数据处理的方法.
You can use Apache Spark to distribute in-memory processing of csv files https://github.com/databricks/spark-csv. Take a look at ADAM's approach for distributed genomic data processing.
这篇关于有效地读取巨大的csv文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!