亚马逊Redshift-Backup&恢复最佳做法? [英] Amazon Redshift-Backup & Restore best practices?

查看:86
本文介绍了亚马逊Redshift-Backup&恢复最佳做法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们在Redshift中有一组表,其中具有IDENTITY属性的列用于序列生成.在测试阶段,需要进行备份和还原,这是每个测试周期的重复活动.我们按照以下过程进行备份,然后还原,并遇到以下问题:

We have a set of tables in Redshift with columns having IDENTITY property in it, for sequence generating. During testing phase there is a requirement of taking a backup and restore and this is a repeatative activity for each cycle of testing. We followed the below processes to take backup and then restore and faced the below issues:

  1. 传统方式:使用CREATE TABLE XYZ_BKP AS SELECT * FROM XYZ在另一个备份模式中创建备份表. 但是这样做会丢失表的IDENTITY和其他属性.因此,在还原过程中,如果尝试直接从备份创建表,则会丢失属性属性,并且无法更改以添加IDENTITY约束.
  2. 传统方式的备份和另一种还原方法:这次我们删除并首先使用DDL重新创建表,然后尝试从备份执行INSERT INTO.但是它不能将值插入IDENTITY列.
  3. UNLOAD和COPY::我们还尝试了Redshift实用程序(例如UNLOAD)对S3中的表进行备份,然后使用副本进行还原.一切正常,但随后我们遇到了其他问题- 一种.前导零的DATE字段未在UNLOAD提取中正确提取.例如:日期"0001-01-01"提取为"1-01-01".然后,它在COPY期间失败,提示日期无效.在还原(COPY)期间还会引发其他一些错误,例如,缺少null字段的数据丢失或int数据类型的值无效.这意味着UNLOAD和COPY命令不能同时使用,并且值会更改.
  4. 从快照还原表:我没有尝试过此操作,但我了解AWS现在支持表还原.但是再次单独设置500个表是一件繁琐的工作.另外,您可以长时间保存和跟踪快照.
  1. Traditional way: Created backup tables in another backup schema with CREATE TABLE XYZ_BKP AS SELECT * FROM XYZ. But doing that we lost the IDENTITY and other attributes of the table. So during restore if you try to create the table from the backup directly you lose the attribute properties and YOU CAN'T ALTER to add IDENTITY constraint.
  2. Traditional way backup and a different restore method: This time we dropped and recreate the table with DDL first and then tried to perform INSERT INTO from backup. But it can't insert values into the IDENTITY columns.
  3. UNLOAD and COPY: We also tried Redshift utilities such as UNLOAD to take a backup of the table in S3 and then restore using copy. It worked fine but then we faced other issues - a. DATE fields having leading zero didn't get extracted properly in the UNLOAD extract. Ex: A Date '0001-01-01' extracted as '1-01-01'. Then it's failing during the COPY saying not a valid date. There are other several errors its throwing during the restore (COPY) such missing data for not null fields or invalid value for int datatype. Which means the UNLOAD and COPY command together don't work in sync and values change.
  4. Table restore from snapshot: I haven't tried this but i understand AWS supports table restore now. But again it's a tedious job to set up this individually for 500 tables. Also you have keep and track snapshots for long.

如果您可以在我的方案中提出最佳的备份和还原方式,或者组织遵循的最佳做法,这将非常有帮助.

It will be very helpful if you could suggest the best possible way to backup and restore in my scenario OR the best practices organizations follow.

推荐答案

在这里我想一点一点地回答,这会有点长,请原谅;),但在我看来,我认为最好的选择是Unload to S3Copy to table from S3.在这里,S3可以替换为EC2.

I would like to answer here point by point so it will be bit long, please excuse me for that;), but in my opinion, I feel that the best option is Unload to S3 and Copy to table from S3. Here, S3 could be replace with EC2.

  1. 传统方式-如果需要进行一些数据替换并且我们希望对查询进行空运行,则我们更愿意.
  2. 传统方式备份和不同的还原方法,与#1相同,我们不使用.
  3. 卸载和复制:这是最方便的方法,甚至IDENTITIES都可以保留,因此始终是首选方法.
  1. Traditional way- we prefer if we need to do some data alternation and we would like to dry run our queries.
  2. Traditional way backup and a different restore method same issues as of #1, we don't use.
  3. UNLOAD and COPY: This is most convenient method and even IDENTITIES could retain, hence always preferred method.

这里列出了一些问题,但是大多数问题都是错误的,或者可以通过提供适当的导出/导入参数来避免.我想提供所有必要步骤的数据,以证明我的观点,即在加载和卸载期间datestimestamps都没有问题.

There are some problems listed in question, but most of them are false or could be avoided by supplying proper export/import parameters. I would like to provide all necessary steps with data to prove my point that, there are no issues in dates and timestamps during the load and unload.

在这里,我正在使用大多数数据类型来证明我的观点.

Here I'm doing most of data types to prove my point.

create table sales(
salesid integer not null Identity,
commission decimal(8,2),
saledate date,
description varchar(255),
created_at timestamp default sysdate,
updated_at timestamp);

CSV(sales-example.txt)中的内容

Content in CSV(sales-example.txt)

salesid,commission,saledate,description,created_at,updated_at
1|3.55|2018-12-10|Test description|2018-05-17 23:54:51|2018-05-17 23:54:51
2|6.55|2018-01-01|Test description|2018-05-17 23:54:51|2018-05-17 23:54:51
4|7.55|2018-02-10|Test description|2018-05-17 23:54:51|2018-05-17 23:54:51
5|3.55||Test description|2018-05-17 23:54:51|2018-05-17 23:54:51
7|3.50|2018-10-10|Test description|2018-05-17 23:54:51|2018-05-17 23:54:51

将导入datetimestamps以及ID的复制命令.

Copy command that will import date, timestamps, as well as IDs.

copy sales(salesid,commission,saledate,description,created_at,updated_at) from 's3://****/de***/sales-example.txt' credentials 'aws_access_key_id=************;aws_secret_access_key=***********' IGNOREHEADER  1 EXPLICIT_IDS;

这将复制5条记录.我正在这里parallel off来获取单个CSV中的数据以证明要点,尽管这不是必需的,应该避免.

This will copy 5 records. I'm doing here parallel off to get data in single CSV to prove point, though not required and should be avoided.

unload ('select salesid,commission,saledate,description,created_at,updated_at from sales') to 's3://assortdw/development/sales-example-2.txt' credentials 'aws_access_key_id=***********;aws_secret_access_key=***********' parallel off;

下面再次是与导入内容完全相同的内容,这意味着如果在其他任何环境(例如devQA或其他地方)运行Copy命令,我将获得与完全相同的记录在Redshift群集中.

And below is my content again that which is exactly same as of import, meaning if run the Copy command to any other environment say dev or QA or somewhere, I will get the exact same records as of in Redshift cluster.

5|3.55||Test description|2018-05-17 23:54:51|2018-05-17 23:54:51
1|3.55|2018-12-10|Test description|2018-05-17 23:54:51|2018-05-17 23:54:51
7|3.50|2018-10-10|Test description|2018-05-17 23:54:51|2018-05-17 23:54:51
2|6.55|2018-01-01|Test description|2018-05-17 23:54:51|2018-05-17 23:54:51
4|7.55|2018-02-10|Test description|2018-05-17 23:54:51|2018-05-17 23:54:51

  1. 从快照还原表:这需要我们的网络/基础结构"组,因此尽管不太确定,但还是避免了. Othe专家非常乐于评论/分享有关此的详细信息.
  1. Table restore from snapshot: This requires our `networking/infrastructure group, hence we avoid, though less sure about it. Othe experts are most wellcome to comment/share details about this.

我希望这能回答问题,并为discuss/summarize/conclude提供一个起点.欢迎大家加分.

I hope this answer the question, as well provide a start point to discuss/summarize/conclude. All are most welcome to add your points.

这篇关于亚马逊Redshift-Backup&恢复最佳做法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆