读取大数据集 pandas [英] Read large dataset Pandas

查看:57
本文介绍了读取大数据集 pandas 的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试读取20gb的数据集.我一直在寻找解决方案,我已经尝试过:

I'm trying to read a dataset of 20gb. I've searched for a solution,I've tried:

   data = pd.read_csv('dataset.csv', chunksize=1000,usecols=fields)
   df = pd.concat(data, ignore_index=True)

,但是在传递给串联时仍然出现内存错误. (我改变了块大小很多次了,还是一样)

but still getting a memory error when passing to concatenate. (I changed chunksize many time, still the same)

我有16GB的RAM,工作频率为3000MHz.

I have 16gb of RAM working at 3000mhz.

有什么建议吗?

我正在尝试将数据导入数据框以进行数据分析并操纵将其导出回去. (需要清除nans和嘈杂的数据).

I am trying to import the data into a dataframe for a Data Analysis and manipulation the export it back. (Data need to be cleaned from nans and noisy data ).

推荐答案

不完全知道您想要/需要使用数据来完成什么操作确实很棘手-但是大多数数据操作都可以通过SQL完成,所以我建议使用 sqlite3 作为数据处理引擎.

Not knowing exactly what you want/need to accomplish with the data does make this tricky - but most data manipulation can be done with SQL and so I would suggest using sqlite3 as the data processing engine.

sqlite3 将数据存储在磁盘上,从而可以避免读取的可能性将20Gb的数据存储到16Gb或RAM中.

sqlite3 stores data on-disk and will circumvent the impossibility of reading 20Gb of data into 16Gb or RAM.

此外,请阅读 pandas.DataFrame的文档.to_sql

您将需要以下内容(未经测试):

You will need something like (not tested):

import sqlite3
conn = sqlite3.connect('out_Data.db')

data = pd.read_csv('dataset.csv', chunksize=1000, usecols=fields)

for data_chunk in data:
    data_chunk.to_sql(conn, if_exists='append')

c = conn.cursor()
c.execute("SELECT * FROM data GROUPBY variable1")
<<<perform data manipulation using SQL>>>

请记住,除非您执行的操作大大减少了内存占用,否则您无法将数据带入pandas数据框中.

Bear in mind that you can't bring your data into a pandas data frame unless the operations that you perform dramatically reduce the memory footprint.

要转换回.csv,请遵循从sqlite3数据库写入CSV在python中

To convert back to .csv follow Write to CSV from sqlite3 database in python

为了获得更好的性能:

  • 将块大小增加到系统可以处理的最大大小
  • sqlite3 CLI实际上具有自动导入.csv文件的方法,该方法比通过python快得多.

这篇关于读取大数据集 pandas 的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆