我怎样才能插入与parquet fileformat和SNAPPY压缩配置单元表? [英] How can I insert into a hive table with parquet fileformat and SNAPPY compression?

查看:929
本文介绍了我怎样才能插入与parquet fileformat和SNAPPY压缩配置单元表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Hive 2.1



我有以下表格定义:

  CREATE EXTERNAL TABLE table_snappy(
a STRING,
b INT)
PARTITIONED BY(c STRING)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io .parquet.serde.ParquetHiveSerDe'
存储为输入文件
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop .hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION'/'
TBLPROPERTIES('parquet.compress'='SNAPPY');

现在,我想将数据插入它:

  INSERT INTO table_snappy PARTITION(c ='something')VALUES('xyz',1); 

但是,当我查看数据文件时,我看到的只是简单的parquet文件,没有任何压缩。如何在这种情况下启用快速压缩?



目标:以parquet格式和SNAPPY压缩配置表格数据。



我曾尝试设置多个属性:

  SET镶木地板.compression = SNAPPY; 
SET hive.exec.compress.output = true;
SET mapred.output.compression.codec = org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type = BLOCK;
SET mapreduce.output.fileoutputformat.compress = true;
SET mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.SnappyCodec;
SET PARQUET_COMPRESSION_CODEC = snappy;

以及

  TBLPROPERTIES('parquet.compression'='SNAPPY'); 

但没有任何帮助。我尝试了与GZIP压缩相同的功能,并且它看起来没有工作。我开始考虑是否有可能。任何帮助表示赞赏。

解决方案

检查压缩与否的最佳方法之一是使用 parquet-工具

 创建外部表testparquet(id int,名称字符串)
存储为镶木地板
location'/ user / cloudera / testparquet /'
tblproperties('parquet.compression'='SNAPPY');

插入testparquet值(1,'Parquet');

现在,当您查看文件时,它可能没有 .snappy 任何地方

  [cloudera @ quickstart〜] $ hdfs dfs -ls / user / cloudera / testparquet 
找到1件商品
-rwxr-xr-x 1 anonymous supergroup 323 2018-03-02 01:07 / user / cloudera / testparquet / 000000_0

让我们进一步检查... ...

pre $ [cloudera @ quickstart〜 ] $ hdfs dfs -get / user / cloudera / testparquet / *
[cloudera @ quickstart〜] $ parquet-tools meta 000000_0
创建者:parquet-mr版本1.5.0-cdh5.12.0(build $ {buildNumber})

文件架构:hive_schema
------------------------------ -------------------------------------------------- -------------------------------------------------- ---------------------------
id:可选INT32 R:0 D:1
名称:可选二进制O :UTF8 R:0 D:1

行组1:RC:1 TS:99
-------------------- --------------------------- -------------------------------------------------- -------------------------------------------------- ----------
id:INT32 SNAPPY DO:0 FPO:4 SZ:45/43 / 0.96 VC:1 ENC:PLAIN,RLE,BIT_PACKED
名称:BINARY SNAPPY DO :0 FPO:49 SZ:58/56 / 0.97 VC:1 ENC:PLAIN,RLE,BIT_PACKED
[cloudera @ quickstart〜] $

它是 snappy 压缩。


Hive 2.1

I have following table definition :

CREATE EXTERNAL TABLE table_snappy (
a STRING,
b INT) 
PARTITIONED BY (c STRING)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION '/'
TBLPROPERTIES ('parquet.compress'='SNAPPY');

Now, I would like to insert data into it :

INSERT INTO table_snappy PARTITION (c='something') VALUES ('xyz', 1);

However, when I look into the data file, all I see is plain parquet file without any compression. How can I enable snappy compression in this case?

Goal : To have hive table data in parquet format and SNAPPY compressed.

I have tried setting multiple properties as well :

SET parquet.compression=SNAPPY;
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;
SET mapreduce.output.fileoutputformat.compress=true;
SET mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET PARQUET_COMPRESSION_CODEC=snappy;

as well as

TBLPROPERTIES ('parquet.compression'='SNAPPY');

but nothing is being helpful. I tried the same with GZIP compression and it seem to be not working as well. I am starting to think if it's possible or not. Any help is appreciated.

解决方案

One of the best ways to check if it is compressed or not, is by using parquet-tools.

create external table testparquet (id int, name string) 
  stored as parquet 
  location '/user/cloudera/testparquet/'
  tblproperties('parquet.compression'='SNAPPY');

insert into testparquet values(1,'Parquet');

Now when you look at the file, it may not have .snappy anywhere

[cloudera@quickstart ~]$ hdfs dfs -ls /user/cloudera/testparquet
Found 1 items
-rwxr-xr-x   1 anonymous supergroup        323 2018-03-02 01:07 /user/cloudera/testparquet/000000_0

Let's inspect it further...

[cloudera@quickstart ~]$ hdfs dfs -get /user/cloudera/testparquet/*
[cloudera@quickstart ~]$ parquet-tools meta 000000_0 
creator:     parquet-mr version 1.5.0-cdh5.12.0 (build ${buildNumber}) 

file schema: hive_schema 
-------------------------------------------------------------------------------------------------------------------------------------------------------------
id:          OPTIONAL INT32 R:0 D:1
name:        OPTIONAL BINARY O:UTF8 R:0 D:1

row group 1: RC:1 TS:99 
-------------------------------------------------------------------------------------------------------------------------------------------------------------
id:           INT32 SNAPPY DO:0 FPO:4 SZ:45/43/0.96 VC:1 ENC:PLAIN,RLE,BIT_PACKED
name:         BINARY SNAPPY DO:0 FPO:49 SZ:58/56/0.97 VC:1 ENC:PLAIN,RLE,BIT_PACKED
[cloudera@quickstart ~]$ 

it is snappy compressed.

这篇关于我怎样才能插入与parquet fileformat和SNAPPY压缩配置单元表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆