读取大数据集 pandas [英] Read large dataset Pandas
问题描述
我正在尝试读取20gb的数据集.我一直在寻找解决方案,我已经尝试过:
I'm trying to read a dataset of 20gb. I've searched for a solution,I've tried:
data = pd.read_csv('dataset.csv', chunksize=1000,usecols=fields)
df = pd.concat(data, ignore_index=True)
,但是在传递给串联时仍然出现内存错误. (我改变了块大小很多次了,还是一样)
but still getting a memory error when passing to concatenate. (I changed chunksize many time, still the same)
我有16GB的RAM,工作频率为3000MHz.
I have 16gb of RAM working at 3000mhz.
有什么建议吗?
我正在尝试将数据导入数据框以进行数据分析并操纵将其导出回去. (需要清除nans和嘈杂的数据).
I am trying to import the data into a dataframe for a Data Analysis and manipulation the export it back. (Data need to be cleaned from nans and noisy data ).
推荐答案
不完全知道您想要/需要使用数据来完成什么操作确实很棘手-但是大多数数据操作都可以通过SQL完成,所以我建议使用 sqlite3 作为数据处理引擎.
Not knowing exactly what you want/need to accomplish with the data does make this tricky - but most data manipulation can be done with SQL and so I would suggest using sqlite3 as the data processing engine.
sqlite3 将数据存储在磁盘上,从而可以避免读取的可能性将20Gb的数据存储到16Gb或RAM中.
sqlite3 stores data on-disk and will circumvent the impossibility of reading 20Gb of data into 16Gb or RAM.
此外,请阅读 pandas.DataFrame的文档.to_sql
您将需要以下内容(未经测试):
You will need something like (not tested):
import sqlite3
conn = sqlite3.connect('out_Data.db')
data = pd.read_csv('dataset.csv', chunksize=1000, usecols=fields)
for data_chunk in data:
data_chunk.to_sql(conn, if_exists='append')
c = conn.cursor()
c.execute("SELECT * FROM data GROUPBY variable1")
<<<perform data manipulation using SQL>>>
请记住,除非您执行的操作大大减少了内存占用,否则您无法将数据带入pandas数据框中.
Bear in mind that you can't bring your data into a pandas data frame unless the operations that you perform dramatically reduce the memory footprint.
要转换回.csv,请遵循从sqlite3数据库写入CSV在python中
To convert back to .csv follow Write to CSV from sqlite3 database in python
为了获得更好的性能:
- 将块大小增加到系统可以处理的最大大小
- sqlite3 CLI实际上具有自动导入.csv文件的方法,该方法比通过python快得多.
这篇关于读取大数据集 pandas 的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!