从csv转换为二进制格式会异常减小文件大小 [英] The conversion from csv to binary format reduces the file size abnormally

查看:374
本文介绍了从csv转换为二进制格式会异常减小文件大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大小为5.2 GB的csv数据集(取自frm 此处).它有大约7M行,维数=29.值的类型为float64.我想将此数据集转换为二进制文件.为此,我执行以下简单的代码行:

I have csv dataset of size 5.2GB ( Taken frm here). It has about 7M rows of dimension= 29. The values are of type float64. I want to convert this dataset into a binary file. To do so, I do the following simple lines:

import numpy as np
import pandas as pd

df = pd.read_csv('data.csv', sep=',')
np.asarray(df.values).tofile('data_binary.dat')

数据快照如下:

0.000000000000000000e+00,9.439358860254287720e-02,1.275558676570653915e-02,9.119330644607543945e-01,-9.083136916160583496e-02,-2.335745543241500854e-01,-1.054220795631408691e+00,-9.759366512298583984e-01,-1.067278265953063965e+00,-6.138502955436706543e-01,7.542607188224792480e-01,-9.256605505943298340e-01,-5.289512276649475098e-01,1.235263347625732422e+00,8.606486320495605469e-01,-2.320102453231811523e-01,-4.043335020542144775e-01,-1.559396624565124512e+00,-8.154401183128356934e-01,-1.376865267753601074e+00,6.759096682071685791e-02,1.372575879096984863e+00,-5.736824870109558105e-01,-1.368692040443420410e+00,-4.793794453144073486e-01,1.529256343841552734e+00,-5.757816433906555176e-01,-1.290232419967651367e+00,4.999999694824218750e+02
1.000000000000000000e+00,3.272003531455993652e-01,-2.395536154508590698e-01,-1.592038273811340332e+00,-2.324983835220336914e+00,-5.070934891700744629e-01,1.574625492095947266e+00,-1.050106048583984375e+00,9.686639308929443359e-01,1.312386870384216309e+00,7.542607188224792480e-01,-9.113077521324157715e-01,-1.718587398529052734e+00,3.751282095909118652e-01,8.606486320495605469e-01,-3.711451292037963867e-01,-5.625200271606445312e-01,-2.721544206142425537e-01,-8.154401183128356934e-01,-3.339428007602691650e-01,1.058411240577697754e+00,4.364815354347229004e-01,-5.736824870109558105e-01,-2.172690257430076599e-02,-5.791836977005004883e-01,-3.260441124439239502e-01,-2.024624943733215332e-01,-4.585579931735992432e-01,7.500000000000000000e+02

新的二进制文件data_binary.dat减少为1.5 GB.这是巨大的减少,这使我想知道我将csv转换为二进制格式的方式是否出了问题.预计会减少吗?至少这么多?谢谢

The new binary file data_binary.dat is reduced to 1.5GB. This is huge reduction which made me wonder if something went wrong with the way I use to convert csv to binary format. Is this reduction expected? At least this much? Thanks

推荐答案

好,所以我去下载了数据样本.每行都是这样的:

Ok, so I went and downloaded a sample of the data. Each row is something like:

0.000000000000000000e+00,9.439358860254287720e-02,1.275558676570653915e-02 ...

每个单独的数字似乎总共有25个字符,如果加上逗号,则实际上是26个左右.因此,每个字符一个字节,大约25个字节.使用64位浮点数的二进制表示形式将需要... 64位,即每个数字8个字节.因此,您应该期望二进制文件的大小小于1/3,所以这似乎是正确的:

Each individual number seems to have 25 character overall, and actually, 26 or so if you include the comma. So that's one byte per character, so about 25 bytes. Using a binary representation of a 64-bit floating point numbers will require ... 64 bits i.e. 8 bytes per number. So You should expect the binary file to be less than 1/3 the size, so this seems correct:

5.2/3 = 1.73 ...

5.2/3 = 1.73...

更好的估计是每个数字大约26个字符(包括逗号和换行符),因此:

A better estimate would be about 26 characters per number (including commas and line-breaks), so:

In [2]: (8/26)*5.2
Out[2]: 1.6

似乎合法.

这篇关于从csv转换为二进制格式会异常减小文件大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆