读取大数据集 pandas [英] Read large dataset Pandas

查看：57 发布时间：2020/5/18 23:45:30 python database pandas numpy dataframe

本文介绍了读取大数据集 pandas 的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试读取20gb的数据集.我一直在寻找解决方案，我已经尝试过:

I'm trying to read a dataset of 20gb. I've searched for a solution,I've tried:

   data = pd.read_csv('dataset.csv', chunksize=1000,usecols=fields)
   df = pd.concat(data, ignore_index=True)

，但是在传递给串联时仍然出现内存错误. (我改变了块大小很多次了，还是一样)

but still getting a memory error when passing to concatenate. (I changed chunksize many time, still the same)

我有16GB的RAM，工作频率为3000MHz.

I have 16gb of RAM working at 3000mhz.

有什么建议吗?

我正在尝试将数据导入数据框以进行数据分析并操纵将其导出回去. (需要清除nans和嘈杂的数据).

I am trying to import the data into a dataframe for a Data Analysis and manipulation the export it back. (Data need to be cleaned from nans and noisy data ).

推荐答案

不完全知道您想要/需要使用数据来完成什么操作确实很棘手-但是大多数数据操作都可以通过SQL完成，所以我建议使用 sqlite3 作为数据处理引擎.

Not knowing exactly what you want/need to accomplish with the data does make this tricky - but most data manipulation can be done with SQL and so I would suggest using sqlite3 as the data processing engine.

sqlite3 将数据存储在磁盘上，从而可以避免读取的可能性将20Gb的数据存储到16Gb或RAM中.

sqlite3 stores data on-disk and will circumvent the impossibility of reading 20Gb of data into 16Gb or RAM.

此外，请阅读 pandas.DataFrame的文档.to_sql

您将需要以下内容(未经测试):

You will need something like (not tested):

import sqlite3
conn = sqlite3.connect('out_Data.db')

data = pd.read_csv('dataset.csv', chunksize=1000, usecols=fields)

for data_chunk in data:
    data_chunk.to_sql(conn, if_exists='append')

c = conn.cursor()
c.execute("SELECT * FROM data GROUPBY variable1")
<<<perform data manipulation using SQL>>>

请记住，除非您执行的操作大大减少了内存占用，否则您无法将数据带入pandas数据框中.

Bear in mind that you can't bring your data into a pandas data frame unless the operations that you perform dramatically reduce the memory footprint.

要转换回.csv，请遵循从sqlite3数据库写入CSV在python中

To convert back to .csv follow Write to CSV from sqlite3 database in python

为了获得更好的性能:

将块大小增加到系统可以处理的最大大小
sqlite3 CLI实际上具有自动导入.csv文件的方法，该方法比通过python快得多.

这篇关于读取大数据集 pandas 的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

读取大数据集 pandas [英] Read large dataset Pandas

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

读取大数据集 pandas [英] Read large dataset Pandas

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭