为什么我的 TFRecord 文件比 csv 大这么多? [英] Why is my TFRecord file so much bigger than csv?

查看:82
本文介绍了为什么我的 TFRecord 文件比 csv 大这么多?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直认为作为二进制格式,TFRecord 会消耗更少空间然后一个人类可读的csv.但是当我尝试比较它们时,我发现情况并非如此.

I always thought that being a binary format, TFRecord will consume less space then a human-readable csv. But when I tried to compare them, I saw that it is not the case.

例如,我在这里创建了一个带有 num_rows 标签的 num_rows X 10 矩阵并将其保存为 csv.我通过将其保存到 TFRecors 来做同样的事情:

For example here I create a num_rows X 10 matrix with num_rows labels and save it as a csv. I do the same by saving it to TFRecors:

import pandas as pd
import tensorflow as tf
from random import randint

num_rows = 1000000
df = pd.DataFrame([[randint(0,300) for r in xrange(10)] + [randint(0, 1)] for i in xrange(num_rows)])

df.to_csv("data/test.csv", index=False, header=False)

writer = tf.python_io.TFRecordWriter('data/test.bin')
for _, row in df.iterrows():
    arr = list(row)
    features, label = arr[:-1], arr[-1]
    example = tf.train.Example(features=tf.train.Features(feature={
        'features' : tf.train.Feature(int64_list=tf.train.Int64List(value=features)),
        'label':     tf.train.Feature(int64_list=tf.train.Int64List(value=[label])),
    }))
    writer.write(example.SerializeToString())
writer.close()

创建二进制文件不仅比 csv 花费更多的时间(2 秒 VS 1 分 50 秒),而且使用的空间几乎是 2 倍(38Mb VS 67.7Mb).

Not only it takes way more time to create a binary file than a csv (2 sec VS 1 min 50 sec), but it also uses almost 2 times more space (38Mb VS 67.7Mb).

我做对了吗?我怎样才能使输出文件更小(看到 TFRecordCompressionType),但是还有什么我可以做的吗?尺寸更大的原因是什么?

Do I do it correctly? How can I make the output file smaller (saw TFRecordCompressionType), but is there anything else I can do? And what is the reason for a much bigger size?

Vijay 对 int64 的评论 有道理,但仍然没有回答所有问题.Int64 消耗 8 个字节,因为我在 csv 中存储数据,整数的字符串表示应该是长度 8.所以如果我这样做 df = pd.DataFrame([[randint(1000000,99999999) for r inxrange(10)] for i in xrange(num_rows)]) 我仍然得到一个稍大的尺寸.现在是 90.9Mb VS 89.1Mb.除此之外,csv 为每个整数之间的每个逗号存储 1 个字节.

Vijay's comment regarding int64 makes sense but still does not answer everything. Int64 consumes 8 bytes, because I am storing data in csv, the string representation of the integer should be of length 8. So if I do this df = pd.DataFrame([[randint(1000000,99999999) for r in xrange(10)] for i in xrange(num_rows)]) I still get a slightly bigger size. Now it is 90.9Mb VS 89.1Mb. In additional to this csv stores 1 byte for each comma between each integers.

推荐答案

文件较大的原因是 TFRecords 对每一行的开销,特别是每次都存储标签名称这一事实.

The fact that your file is bigger is due to the overhead that TFRecords has for each row, in particular the fact that the label names are stored every time.

在您的示例中,如果您增加特征数量(从 10 到 1000),您会发现您的 tfrecord 文件实际上大约是 csv 大小的一半.

In your example, if you increase the number of features (from 10 to say 1000) you will observe that your tfrecord file is actually about half the size of the csv.

此外,整数存储在 64 位上这一事实最终无关紧要,因为序列化使用varint"编码,该编码取决于整数的值,而不是其初始编码.以上面的示例为例,使用 300 的常量值代替 0 到 300 之间的随机值:您会看到文件大小增加.

Also that the fact that integers are stored on 64 bits is eventually irrelevant, because the serialization uses a "varint" encoding that depends on the value of the integer, not on its initial encoding. Take your example above, and instead of a random value between 0 and 300, use a constant value of 300: you will see that your file size increases.

请注意,用于编码的字节数并不完全是整数本身的字节数.所以 255 的值仍然需要两个字节,但 127 的值将需要一个字节.有趣的是,负值会带来巨大的损失:无论如何都需要 10 字节的存储空间.

Note that the number of bytes used for the encoding is not exactly that of the integer itself. So a value of 255 will still need two bytes, but a value of 127 will take one byte. Interesting to know, negative values come with a huge penalty: 10 bytes for storage no matter what.

值和存储要求的对应关系见protobuf的函数_SignedVarintSize.

The correspondance between values and storage requirements is found in protobuf's function _SignedVarintSize.

这篇关于为什么我的 TFRecord 文件比 csv 大这么多?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆