pickling pandas数据帧将文件大小乘以5 [英] pickling pandas dataframe does multiply by 5 the file size

查看:364
本文介绍了pickling pandas数据帧将文件大小乘以5的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用panda.csv_reader读取一个800 Mo csv文件,然后使用原始的pythoin pickle.dump(datfarame)来保存它。结果是一个4 Gb的pkl文件,所以csv大小乘以5.

I am reading a 800 Mo csv file with panda.csv_reader, and then use the original pythoin pickle.dump(datfarame) to save it. The result is a 4 Gb pkl file, so the csv size is multipled by 5.

我期望pickle来压缩数据而不是扩展它。也因为我可以在csv文件上做一个gzip压缩到200 Mo,除以4。

I expected pickle to compress data rather than extend it. Also because I can do a gzip on the csv file which compress it to 200 Mo, dividing it by 4.

我愿意加速我的程序的加载时间,并认为pickling会有帮助,但考虑磁盘访问是主要的瓶颈我理解,我宁愿不得不压缩文件,然后使用压缩选项从pandas.csv_read加快加载时间。

I am willing to accelerate the loading time of my program, and thought that pickling would help, but considering disk access is the main bottleneck I am understanding that I would rather have to compress the files and then use the compression option from pandas.csv_read to speed up the loading time.

这是否正确?

腌制pandas数据帧是否正常扩展数据大小?

Is it normal that pickling pandas dataframe extend the data size ?

通常如何加快加载时间?

How do you speed up loading time usually ?

Thks。

推荐答案

这可能是在你的最佳利益是在某种数据库中存储您的CSV文件,并对其执行操作,而不是将CSV文件加载到RAM,因为 Kathirmani 建议。

It is likely in your best interest to stash your CSV file in a database of some sort and perform operations on that rather than loading the CSV file to RAM, as Kathirmani suggested. You will see the speedup in loading time that you expect due simply to the fact that you are not filling up 800 Mb worth of RAM every time you load your script.

文件,你可以看到加载时间的加速,这是因为你没有在每次加载脚本时填充800Mb的RAM。压缩和加载时间是你似乎想要完成的两个冲突元素。压缩CSV文件并加载,这将需要更多 时间;您现在已经添加了解压缩文件的额外步骤,这不能解决您的问题。

File compression and loading time are two conflicting elements of what you seem to be trying to accomplish. Compressing the CSV file and loading that will take more time; you've now added the extra step of having to decompress the file, which doesn't solve your problem.

考虑一个前期步骤将数据发送到 sqlite3 数据库,如下所示:使用Python将CSV文件导入到sqlite3数据库表中

Consider a precursory step to ship the data to an sqlite3 database, as described here: Importing a CSV file into a sqlite3 database table using Python.

您现在可以查询您的数据子集,并快速将其加载到 pandas.DataFrame 中以供进一步使用,如下所示:

You now have the pleasure of being able to query a subset of your data and quickly load it into a pandas.DataFrame for further use, as follows:

from pandas.io import sql
import sqlite3

conn = sqlite3.connect('your/database/path')
query = "SELECT * FROM foo WHERE bar = 'FOOBAR';"

results_df = sql.read_frame(query, con=conn)
...


b $ b

相反,您可以使用 pandas.DataFrame.to_sql()来保存这些文件以备将来使用。

Conversely, you can use pandas.DataFrame.to_sql() to save these for later use.

这篇关于pickling pandas数据帧将文件大小乘以5的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆