如何在 Spark 中设置 Parquet 文件编码 [英] How to set Parquet file encoding in Spark

查看:75
本文介绍了如何在 Spark 中设置 Parquet 文件编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Parquet 文档描述了几种不同的编码此处

Parquet documentation describe few different encodings here

它在读/写期间是否以某种方式在文件内部发生了变化,或者我可以设置它?Spark 文档中没有任何关于它的内容.仅从 Netflix 团队的 Ryan Blue 的演讲中找到 幻灯片.他将 parquet 配置设置为 sqlContext

Is it changes somehow inside file during read/write, or I can set it? Nothing about it in Spark documentation. Only found slides from speach by Ryan Blue from Netflix team. He sets parquet configurations to sqlContext

sqlContext.setConf("parquet.filter.dictionary.enabled", "true")

看起来这与 Parquet 文件中的普通字典编码无关.

Looks like it's not about plain dictionary encoding in Parquet files.

推荐答案

所以我在 twitter 工程上找到了我的问题的答案 博客.

So I found an answer to my question on twitter engineering blog.

Parquet 有一个自动字典编码,当多个唯一值小于10^5.这里是一篇宣布 Parquet 1.0 具有自调整字典编码功能的帖子

Parquet has an automatic dictionary encoding enabled when a number of unique values < 10^5. Here is a post announcing Parquet 1.0 with self-tuning dictionary encoding

UPD:

字典编码可以在 SparkSession 配置中切换:

Dictionary encoding can be switched in SparkSession configs:

SparkSession.builder
            .appName("name")
            .config("parquet.enable.dictionary","false") //true

关于按列编码,作为 Parquet 的改进,有一个开放的问题Jira 于 17 年 7 月 14 日创建.由于字典编码是默认设置并且仅适用于所有表,因此它会关闭 Delta Encoding(Jira issue 这个错误)这是唯一适合数据的编码,比如几乎每个值都是唯一的时间戳.

Regarding encoding by column, there is an open issue as improvement in Parquet’s Jira that was created on 14th July, 17. Since dictionary encoding is a default and works only for all table it turns off Delta Encoding(Jira issue for this bug) which is the only suitable encoding for data like timestamps where almost each value is unique.

UPD2

我们如何知道输出文件使用了哪种编码?

How can we tell which encoding was used for an output file?

  • 我使用了镶木地板工具.

  • I used parquet-tools for it.

-> brew install parquet-tools (for mac)
-> parquet-tools meta your_parquet_file.snappy.parquet

-> brew install parquet-tools (for mac)
-> parquet-tools meta your_parquet_file.snappy.parquet

输出:

.column_1: BINARY SNAPPY DO:0 FPO:16637 SZ:2912/8114/3.01 VC:26320 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
.column_2: BINARY SNAPPY DO:0 FPO:25526 SZ:119245/711487/1.32 VC:26900 ENC:PLAIN,RLE,BIT_PACKED
.

其中 PLAIN 和 PLAIN_DICTIONARY 是用于该列的编码

Where PLAIN and PLAIN_DICTIONARY are encodings which were used for that columns

这篇关于如何在 Spark 中设置 Parquet 文件编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆