使用 pyarrow/parquet-cpp 重新分区 parquet-mr 生成的镶木地板会使文件大小增加 x30? [英] Repartitioning parquet-mr generated parquets with pyarrow/parquet-cpp increases file size by x30?

查看:66
本文介绍了使用 pyarrow/parquet-cpp 重新分区 parquet-mr 生成的镶木地板会使文件大小增加 x30?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用 AWS Firehose,我将传入的记录转换为镶木地板.在一个例子中,我有 150k 条相同的记录进入 firehose,一个 30kb 的镶木地板被写入 s3.由于 firehose 对数据进行分区的方式,我们有一个辅助进程(由 s3 put 事件触发的 lambda)在 parquet 中读取并根据事件本身中的日期对其进行重新分区.经过这个重新分区过程,30kb 的文件大小跃升至 900kb.

Using AWS Firehose I am converting incoming records to parquet. In one example, I have 150k identical records enter firehose, and a single 30kb parquet gets written to s3. Because of how firehose partitions data, we have a secondary process (lambda triggered by s3 put event) read in the parquet and repartitions it based on the date within the event itself. After this repartitioning process, the 30kb file size jumps to 900kb.

检查两个镶木地板文件-

Inspecting both parquet files-

  • 元不会改变
  • 数据不会改变
  • 它们都使用 SNAPPY 压缩
  • firehose parquet 由 parquet-mr 创建,pyarrow 生成的 parquet 由 parquet-cpp 创建
  • pyarrow 生成的 parquet 有额外的 Pandas 标题

完整的重新分区过程-

import pyarrow.parquet as pq

tmp_file = f'{TMP_DIR}/{rand_string()}'
s3_client.download_file(firehose_bucket, key, tmp_file)

pq_table = pq.read_table(tmp_file)

pq.write_to_dataset(
    pq_table,
    local_partitioned_dir,
    partition_cols=['year', 'month', 'day', 'hour'],
    use_deprecated_int96_timestamps=True
)

我想会有一些尺寸变化,但我惊讶地发现了如此大的差异.鉴于我所描述的过程,什么会导致源镶木地板从 30kb 变为 900kb?

I imagine there would be some size change, but I was surprised to find such a big difference. Given the process i've described, what would cause the source parquet to go from 30kb to 900kb?

推荐答案

Parquet 使用不同的列编码非常有效地存储低熵数据.例如:

Parquet uses different column encodings to store low entropy data very efficiently. For example:

  • 它可以使用增量编码来仅存储值之间的差异.例如 9192631770, 9192631773, 9192631795, 9192631797 将有效地存储为 9192631770, +3, +12, +2.
  • 它可以使用字典编码来简短地引用公共值.例如,Los Angeles, Los Angeles, Los Angeles, San Francisco, San Francisco 将存储为 0 = Los Angeles, 1 = San Francisco 和引用的字典0, 0, 0, 1, 1
  • 它可以使用游程编码来仅存储重复值的数量.例如,Los Angeles, Los Angeles, Los Angeles 将有效地存储为 Los Angeles×3.(其实据我所知,目前纯 RLE 只用于布尔类型,但思路是一样的.)
  • 以上的组合,特别是 RLE 和字典编码.例如,Los Angeles, Los Angeles, Los Angeles, San Francisco, San Francisco 将存储为 0 = Los Angeles, 1 = San Francisco 和引用的字典0×3, 1×2
  • It can use delta encoding to only store differences between values. For example 9192631770, 9192631773, 9192631795, 9192631797 would be stored effectively as 9192631770, +3, +12, +2.
  • It can use dictionary encoding to shortly refer to common values. For example, Los Angeles, Los Angeles, Los Angeles, San Francisco, San Francisco would be stored as a dictionary of 0 = Los Angeles, 1 = San Francisco and the references 0, 0, 0, 1, 1
  • It can use run-length encoding to only store the number of repeating values. For example, Los Angeles, Los Angeles, Los Angeles would be effectively stored as Los Angeles×3. (Actually as far as I know pure RLE is only used for boolean types at this moment, but the idea is the same.)
  • A combination of the above, specifically RLE and dictionary encoding. For example, Los Angeles, Los Angeles, Los Angeles, San Francisco, San Francisco would be stored as a dictionary of 0 = Los Angeles, 1 = San Francisco and the references 0×3, 1×2

使用上述示例的 3 到 5 个值,节省的金额并不大,但您拥有的值越多,收益就越大.由于您有 150k 条相同的记录,因此收益将是巨大的,因为使用 RLE 字典编码,每个列值只需存储一次,然后标记为重复 150k 次.

With the 3 to 5 values of the examples above, the savings are not that significant, but the more values you have the bigger the gain. Since you have 150k identical records, the gains will be huge, since with RLE dictionary encoding, each column value will only have to be stored once, and then marked as repeating 150k times.

然而,pyarrow 似乎并没有使用这些节省空间的编码.您可以通过使用 parquet-tools meta 查看两个文件的元数据来确认这一点.这是一个示例输出:

However, it seems that pyarrow does not use these space-saving encodings. You can confirm this by taking a look at the metadata of the two files using parquet-tools meta. Here is a sample output:

file schema: hive_schema 
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
id:          OPTIONAL INT32 R:0 D:1
name:        OPTIONAL BINARY O:UTF8 R:0 D:1

row group 1: RC:61 TS:214 OFFSET:4 
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
id:           INT32 UNCOMPRESSED DO:0 FPO:4 SZ:107/107/1.00 VC:61 ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY ST:[min: 1, max: 5, num_nulls: 0]
name:         BINARY UNCOMPRESSED DO:0 FPO:111 SZ:107/107/1.00 VC:61 ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY ST:[min: Los Angeles, max: San Francisco, num_nulls: 0]

编码显示为ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY.

这篇关于使用 pyarrow/parquet-cpp 重新分区 parquet-mr 生成的镶木地板会使文件大小增加 x30?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆