写入外部Hive表时的Spark压缩 [英] Spark compression when writing to external Hive table

查看：141 发布时间：2021/4/8 19:35:27 apache-spark hive parquet

本文介绍了写入外部Hive表时的Spark压缩的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在从Spark 2.1(使用 df.write.insertInto(...)插入到外部蜂巢-实木复合地板表中.通过设置例如

I'm inserting into an external hive-parquet table from Spark 2.1 (using df.write.insertInto(...). By setting e.g.

spark.sql("SET spark.sql.parquet.compression.codec=GZIP")

我可以在SNAPPY，GZIP和未压缩之间切换.我可以验证文件大小(和文件名结尾)是否受这些设置影响.我得到一个名为例如的文件

I can switch between SNAPPY,GZIP and uncompressed. I can verify that the file size (and filename ending) is influenced by these settings. I get a file named e.g.

part-00000-5efbfc08-66fe-4fd1-bebb-944b34689e70.gz.parquet

但是如果我使用分区的Hive表，此设置没有任何效果，文件大小始终相同.此外，文件名始终为

However if I work with partitioned Hive table, this setting does not have any effect, the file size is always the same. In addition, the filename is always

part-00000

现在如何在分区情况下更改(或至少验证)实木复合地板文件的压缩编解码器?

Now how can I change (or at least verify) the compression codec of the parquet files in the partitioned case?

我的桌子是:

CREATE EXTERNAL TABLE `test`(`const` string, `x` int)
PARTITIONED BY (`year` int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
)
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'

推荐答案

在创建外部表时，我将像这样进行操作:

As you create external table, I would proceed like this :

首先使用所需的压缩率编写镶木地板数据集:

First write your parquet dataset with the required compression:

df.write
 .partitionBy("year")
 .option("compression","<gzip|snappy|none>")
 .parquet("<parquet_file_path>")

您可以像以前一样检查文件扩展名.然后，您可以按照以下步骤创建外部表:

you can check as before with the file extension. Then,you can create your external table as follow :

CREATE EXTERNAL TABLE `test`(`const` string, `x` int)
PARTITIONED BY (`year` int)
STORED AS PARQUET
LOCATION '<parquet_file_path>';

如果外部表已在Hive中存在，则只需运行以刷新表:

If the external table already exists in Hive, you just need to run to refresh your table:

MSCK REPAIR TABLE test;

这篇关于写入外部Hive表时的Spark压缩的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

写入外部Hive表时的Spark压缩 [英] Spark compression when writing to external Hive table

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

写入外部Hive表时的Spark压缩 [英] Spark compression when writing to external Hive table

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭