Redshift复制从分析创建不同的压缩编码 [英] Redshift copy creates different compression encodings from analyze

查看:101
本文介绍了Redshift复制从分析创建不同的压缩编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我注意到,AWS Redshift建议使用不同于将数据(通过COPY)加载到空表时自动创建的列压缩编码.

I've noticed that AWS Redshift recommends different column compression encodings from the ones that it automatically creates when loading data (via COPY) to an empty table.

例如,我创建了一个表并从S3加载数据,如下所示:

For example, I have created a table and loaded data from S3 as follows:

CREATE TABLE Client (Id varchar(511) , ClientId integer , CreatedOn timestamp, 
UpdatedOn timestamp ,  DeletedOn timestamp , LockVersion integer , RegionId 
varchar(511) , OfficeId varchar(511) , CountryId varchar(511) ,  
FirstContactDate timestamp , DidExistPre boolean , IsActive boolean , 
StatusReason integer ,  CreatedById varchar(511) , IsLocked boolean , 
LockType integer , KeyWorker varchar(511) ,  InactiveDate timestamp , 
Current_Flag varchar(511) );

表客户端创建的执行时间:0.3秒

Table Client created Execution time: 0.3s

copy Client from 's3://<bucket-name>/<folder>/Client.csv' 
credentials 'aws_access_key_id=<access key>; aws_secret_access_key=<secret>' 
csv fillrecord truncatecolumns ignoreheader 1 timeformat as 'YYYY-MM-
DDTHH:MI:SS' gzip acceptinvchars compupdate on region 'ap-southeast-2';    

警告:加载到表"client"中的结果已完成,共有24284条记录 加载成功.加载到表"client"中已完成,6 记录已加载ACCEPTINVCHARS的替换记录.查看 详细信息,请参见"stl_replacements"系统表.

Warnings: Load into table 'client' completed, 24284 record(s) loaded successfully. Load into table 'client' completed, 6 record(s) were loaded with replacements made for ACCEPTINVCHARS. Check 'stl_replacements' system table for details.

0行受影响的COPY成功执行

0 rows affected COPY executed successfully

执行时间:3.39s

Execution time: 3.39s

完成此操作后,我可以查看COPY已应用的列压缩编码:

Having done this I can look at the column compression encodings that have been applied by COPY:

select "column", type, encoding, distkey, sortkey, "notnull" 
from pg_table_def where tablename = 'client';

给予:

╔══════════════════╦═════════════════════════════╦═══════╦═══════╦═══╦═══════╗
║ id               ║ character varying(511)      ║ lzo   ║ false ║ 0 ║ false ║
║ clientid         ║ integer                     ║ delta ║ false ║ 0 ║ false ║
║ createdon        ║ timestamp without time zone ║ lzo   ║ false ║ 0 ║ false ║
║ updatedon        ║ timestamp without time zone ║ lzo   ║ false ║ 0 ║ false ║
║ deletedon        ║ timestamp without time zone ║ none  ║ false ║ 0 ║ false ║
║ lockversion      ║ integer                     ║ delta ║ false ║ 0 ║ false ║
║ regionid         ║ character varying(511)      ║ lzo   ║ false ║ 0 ║ false ║
║ officeid         ║ character varying(511)      ║ lzo   ║ false ║ 0 ║ false ║
║ countryid        ║ character varying(511)      ║ lzo   ║ false ║ 0 ║ false ║
║ firstcontactdate ║ timestamp without time zone ║ lzo   ║ false ║ 0 ║ false ║
║ didexistprecirts ║ boolean                     ║ none  ║ false ║ 0 ║ false ║
║ isactive         ║ boolean                     ║ none  ║ false ║ 0 ║ false ║
║ statusreason     ║ integer                     ║ none  ║ false ║ 0 ║ false ║
║ createdbyid      ║ character varying(511)      ║ lzo   ║ false ║ 0 ║ false ║
║ islocked         ║ boolean                     ║ none  ║ false ║ 0 ║ false ║
║ locktype         ║ integer                     ║ lzo   ║ false ║ 0 ║ false ║
║ keyworker        ║ character varying(511)      ║ lzo   ║ false ║ 0 ║ false ║
║ inactivedate     ║ timestamp without time zone ║ lzo   ║ false ║ 0 ║ false ║
║ current_flag     ║ character varying(511)      ║ lzo   ║ false ║ 0 ║ false ║
╚══════════════════╩═════════════════════════════╩═══════╩═══════╩═══╩═══════╝

然后我可以做:

analyze compression client;

给予:

╔════════╦══════════════════╦═══════╦═══════╗
║ client ║ id               ║ zstd  ║ 40.59 ║
║ client ║ clientid         ║ delta ║ 0.00  ║
║ client ║ createdon        ║ zstd  ║ 19.85 ║
║ client ║ updatedon        ║ zstd  ║ 12.59 ║
║ client ║ deletedon        ║ raw   ║ 0.00  ║
║ client ║ lockversion      ║ zstd  ║ 39.12 ║
║ client ║ regionid         ║ zstd  ║ 54.47 ║
║ client ║ officeid         ║ zstd  ║ 88.84 ║
║ client ║ countryid        ║ zstd  ║ 79.13 ║
║ client ║ firstcontactdate ║ zstd  ║ 22.31 ║
║ client ║ didexistprecirts ║ raw   ║ 0.00  ║
║ client ║ isactive         ║ raw   ║ 0.00  ║
║ client ║ statusreason     ║ raw   ║ 0.00  ║
║ client ║ createdbyid      ║ zstd  ║ 52.43 ║
║ client ║ islocked         ║ raw   ║ 0.00  ║
║ client ║ locktype         ║ zstd  ║ 63.01 ║
║ client ║ keyworker        ║ zstd  ║ 38.79 ║
║ client ║ inactivedate     ║ zstd  ║ 25.40 ║
║ client ║ current_flag     ║ zstd  ║ 90.51 ║
╚════════╩══════════════════╩═══════╩═══════╝

即完全不同的结果.

我很想知道为什么会这样?我发现〜24K记录少于 AWS指定的

I'm keen to know why this might be? I get that ~24K records are less than the 100K that AWS specifies as being required for a meaningful compression analysis sample, however it still seems strange that COPY and ANALYZE are giving different results for the same 24K row table.

推荐答案

COPY当前不建议使用ZSTD,这就是建议的压缩设置不同的原因.

COPY doesn't currently recommend ZSTD which is why the recommended compression settings are different.

如果要在要最大化压缩(使用最少空间)的永久表上应用压缩,则全面设置ZSTD将使您接近最佳压缩.

If you're looking to apply compression on permanent tables where you want to maximize compression (use least amount of space), setting ZSTD across the board will give you close to optimal compression.

RAW在某些列上返回的原因是,在这种情况下,应用压缩(具有和不具有压缩的相同块数)没有优势.如果您知道表将会增长,那么也可以对这些列应用压缩.

The reason RAW is coming back on some columns is because in this case there is no advantage to applying compression (same number of blocks with and without compression). If you know table will be growing it makes sense to apply compression to those columns as well.

这篇关于Redshift复制从分析创建不同的压缩编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆