为什么我的 TFRecord 文件比 csv 大这么多? [英] Why is my TFRecord file so much bigger than csv?

查看：82 发布时间：2021/9/5 19:34:30 tensorflow

本文介绍了为什么我的 TFRecord 文件比 csv 大这么多?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我一直认为作为二进制格式，TFRecord 会消耗更少空间然后一个人类可读的csv.但是当我尝试比较它们时，我发现情况并非如此.

I always thought that being a binary format, TFRecord will consume less space then a human-readable csv. But when I tried to compare them, I saw that it is not the case.

例如，我在这里创建了一个带有 num_rows 标签的 num_rows X 10 矩阵并将其保存为 csv.我通过将其保存到 TFRecors 来做同样的事情:

For example here I create a num_rows X 10 matrix with num_rows labels and save it as a csv. I do the same by saving it to TFRecors:

import pandas as pd
import tensorflow as tf
from random import randint

num_rows = 1000000
df = pd.DataFrame([[randint(0,300) for r in xrange(10)] + [randint(0, 1)] for i in xrange(num_rows)])

df.to_csv("data/test.csv", index=False, header=False)

writer = tf.python_io.TFRecordWriter('data/test.bin')
for _, row in df.iterrows():
    arr = list(row)
    features, label = arr[:-1], arr[-1]
    example = tf.train.Example(features=tf.train.Features(feature={
        'features' : tf.train.Feature(int64_list=tf.train.Int64List(value=features)),
        'label':     tf.train.Feature(int64_list=tf.train.Int64List(value=[label])),
    }))
    writer.write(example.SerializeToString())
writer.close()

创建二进制文件不仅比 csv 花费更多的时间(2 秒 VS 1 分 50 秒)，而且使用的空间几乎是 2 倍(38Mb VS 67.7Mb).

Not only it takes way more time to create a binary file than a csv (2 sec VS 1 min 50 sec), but it also uses almost 2 times more space (38Mb VS 67.7Mb).

我做对了吗?我怎样才能使输出文件更小(看到 TFRecordCompressionType)，但是还有什么我可以做的吗?尺寸更大的原因是什么?

Do I do it correctly? How can I make the output file smaller (saw TFRecordCompressionType), but is there anything else I can do? And what is the reason for a much bigger size?

Vijay 对 int64 的评论 有道理，但仍然没有回答所有问题.Int64 消耗 8 个字节，因为我在 csv 中存储数据，整数的字符串表示应该是长度 8.所以如果我这样做 df = pd.DataFrame([[randint(1000000,99999999) for r inxrange(10)] for i in xrange(num_rows)]) 我仍然得到一个稍大的尺寸.现在是 90.9Mb VS 89.1Mb.除此之外，csv 为每个整数之间的每个逗号存储 1 个字节.

Vijay's comment regarding int64 makes sense but still does not answer everything. Int64 consumes 8 bytes, because I am storing data in csv, the string representation of the integer should be of length 8. So if I do this df = pd.DataFrame([[randint(1000000,99999999) for r in xrange(10)] for i in xrange(num_rows)]) I still get a slightly bigger size. Now it is 90.9Mb VS 89.1Mb. In additional to this csv stores 1 byte for each comma between each integers.

为什么我的 TFRecord 文件比 csv 大这么多? [英] Why is my TFRecord file so much bigger than csv?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

为什么我的 TFRecord 文件比 csv 大这么多? [英] Why is my TFRecord file so much bigger than csv?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭