从S3到Verica的定界文件中加载镶木地板文件时,Vertica性能下降 [英] Vertica performance degradation while loading parquet files over delimited files from s3 to vertica

查看:75
本文介绍了从S3到Verica的定界文件中加载镶木地板文件时,Vertica性能下降的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有GZIP压缩的20亿条记录和SNAPPY压缩的相同数据的实木复合地板文件.另外,我有相同的20亿条记录的定界文件. AWS产品中有72个Vertica节点,当使用COPY命令将数据从s3移到Vertica而不是Delimited文件时,我们看到了镶木地板文件的性能大幅提高.尽管分隔文件的大小是镶木地板的50倍,但镶木地板的时间比分隔文件多7倍.

I have parquet files for 2 Billion records with GZIP compression and the same data with SNAPPY compression. Also, I have Delimited files for the same 2 Billion records. We have 72 Vertica nodes in AWS prod, we are seeing a huge performance spike for parquet files while moving data from s3 to Vertica with COPY command than Delimited files. Parquet takes 7x more time than Delimited files eventhough delimited file size is 50X more than parquet.

以下是我们进行的测试的统计信息.

Below are the stats for the test we conducted.

文件总大小为

镶木地板GZIP-6 GB

Parquet GZIP - 6 GB

镶木地板快照-9.2 GB

Parquet Snappy - 9.2 GB

定界-450GB

下面是用于Parquet和Delimited的复制命令.当我们在复制查询中删除无提交"时,确实看到了2分钟的改进.

Below are the copy command used for both Parquet and Delimited. We did see some 2 mins improvement when we removed "No commit" in the copy query.

分隔文件

COPY schema.table1 ( col1,col2,col3,col4,col5 ) FROM 's3://path_to_delimited_s3/*' DELIMITER E'\001' NULL AS '\N' NO ESCAPE ABORT ON ERROR DIRECT NO COMMIT;

实木复合地板文件

COPY schema.table2 (col1,col2,col3,col4,col5 ) FROM 's3://path_to_parquet_s3/*' PARQUET ABORT ON ERROR DIRECT NO COMMIT;

我们很惊讶地看到镶木地板文件出现尖峰,这是镶木地板复制品所期望的吗?任何指示,想法都会很有帮助.

We are surprised to see this spike w.r.t parquet files, Is this expected for parquet copy? Any pointers, thoughts will be really helpful.

谢谢

推荐答案

一无所知,很难回答.您应该再次监视LOAD_STREAMS以了解发生了什么情况.

Without knowing anything more it's difficult to answer. You should, again, monitor LOAD_STREAMS for finding out what's going on.

一个原因可能是s3://path_to_parquet_s3/*中CSV版本的各种文件在加载过程的节点之间最佳分配,因此大大增强了并行性.

One reason could be that the various files in s3://path_to_parquet_s3/* for the CSV version are optimally distributed between the nodes of your load process, therefore enhancing the parallelism considerably.

要计算解析线程的数量-在加载正在运行时,找到正在运行的LOAD_STREAM(WHERE is_executing ...),然后将其与LOAD_SOURCES-USING(transaction_id,statement_id)一起加入.

To count the number of parsing threads - while your load is running, find your running LOAD_STREAM (WHERE is_executing ...), and then join that with LOAD_SOURCES - USING(transaction_id,statement_id).

这篇关于从S3到Verica的定界文件中加载镶木地板文件时,Vertica性能下降的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆