Redshift卸载的文件名 [英] Redshift unload's file name

查看:279
本文介绍了Redshift卸载的文件名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在运行Redshift卸载命令,但是没有得到我想要的名称.命令是:

I'm running a Redshift unload command, but am not getting the name I desire. The command is:

UNLOAD ('select * from foo')
                 TO 's3://mybucket/foo'
                 CREDENTIALS 'xxxxxx'
                 GZIP
                 NULL AS 'NULL'
                 DELIMITER as '\t'
                 allowoverwrite
                 parallel off

结果是mybucket/foo-000.gz.我不希望切片号在文件名的末尾(如果可以完全消除,那就太好了),我想添加一个文件扩展名在文件名的末尾.我希望看到以下任何一个:

The result is mybucket/foo-000.gz. I don't want the slice number to be the end of the file name (it'd be great if it can be eliminated completely), I want to add a file extension at end of the file name. I'd like to see either of the following:

  • mybucket/foo-000.txt.gz
  • mybucket/foo.txt.gz

有什么方法可以做到这一点(无需编写lambda后处理重命名器脚本)?

Is there any way to do this (without writing a lambda post process renamer script)?

推荐答案

TL; DR
不.

说明:
Amazon Redshift UNLOAD 文档中所述,如果您不这样做的话想要将其拆分为几个部分,可以使用PARALLEL FALSE,但是强烈建议将其保持启用状态.即使那样,该文件也将始终包含000.[EXT]后缀(仅当启用压缩后,当[EXT]存在时),因为Redshift可以输出的文件大小受到限制,如文档:

Explanation:
As it says in Amazon Redshift UNLOAD document, if you do not want it to be split into several parts, you can use PARALLEL FALSE, but it is strongly recommended to leave it enabled. Even then, the file will always include the 000.[EXT] suffix (when the [EXT] exists only when the compression is enabled), because there is a limit to a file size that Redshift can output, as says in the documentation:

默认情况下,UNLOAD将数据并行写入多个文件, 根据集群中的切片数量.默认选项 为ON或TRUE.如果PARALLEL为OFF或FALSE,则UNLOAD写入一个或 串行传输更多数据文件,完全根据ORDER BY排序 子句(如果使用了一个).数据文件的最大大小为6.2 GB. 因此,例如,如果您卸载13.4 GB的数据,则UNLOAD将创建 以下三个文件.

By default, UNLOAD writes data in parallel to multiple files, according to the number of slices in the cluster. The default option is ON or TRUE. If PARALLEL is OFF or FALSE, UNLOAD writes to one or more data files serially, sorted absolutely according to the ORDER BY clause, if one is used. The maximum size for a data file is 6.2 GB. So, for example, if you unload 13.4 GB of data, UNLOAD creates the following three files.

s3://mybucket/key000    6.2 GB 
s3://mybucket/key001    6.2 GB
s3://mybucket/key002    1.0 GB

因此,它将始终至少添加前缀000,因为Redshift首先不知道要输出的文件大小,因此,他添加了此后缀以防输出到达大小为6.2 GB.

Therefore, it will alway add at least the prefix 000, because Redshift doesn't know what size of the file he is going to output in the first place, so he's adding this suffix in case the output will reach the size of 6.2 GB.

如果您问为什么不建议使用PARALLEL FALSE,我将尝试在以下几点中进行解释:

If you ask why the use of PARALLEL FALSE is not recommended, I'll try to explain it in several points:

  1. 最重要的原因是由于Redshift集群的设计方式.每个群集至少包括2个服务器,其中一个服务器是领导节点,其余服务器是数据节点.领导节点的目的是控制数据节点,它保存必要的信息以与Redshift中的所有数据(读或写)一起工作.
    当标记PARALLELTRUE时从Redshift卸载数据时,当X是您选择为其构建Redshift集群的节点数时,它将至少创建X个文件.这意味着,数据是直接从数据节点本身写入的,这要快得多,因为它是并行执行的,并且会跳过前导节点.
    当您决定关闭此标志时,所有数据都将从所有数据节点收集到一个节点(领导节点)中,因为它需要重新组织要输出的行的排序并在需要时将其压缩为单个溪流.此操作会使您的数据写入速度慢得多.
  2. 此外,由于所有内容(读取和写入查询)都通过领导者节点,因此,这会极大地降低Redshift集群的性能,并且如上所述,当领导者节点超载时,是性能问题.
    查询COPYUNLOAD直接与数据节点一起使用,因此,它们的行为几乎与使用PARALLEL TRUE的行为相同.相反,SELECTUPDATEDELETEINSERT之类的查询是由领导节点处理的,这就是为什么它们承受领导节点负载的原因.
  1. The most important reason is because of the way a Redshift cluster designed. Each cluster includes at least 2 servers, when one of them is a leader node and the rest are data nodes. The purpose of leader node, is to control the data nodes, it hold the necessary information to work with all data in Redshift, either read or write.
    When you unload data from Redshift while the flag PARALLEL is TRUE, it will create at least X files, when X is the number of nodes you choose to construct the Redshift cluster of, in the first place. It means, that the data is written directly from the data nodes themselves, which is much faster because it's doing it in parallel and skips the leader node.
    When you decide to turn this flag to off, all data is gathered from all of the data nodes into a single node, the leader node, because it needs to reorganize the sorting of the rows to output and also compress it if needed as a single stream. This action causes you data to be written much slower.
  2. Also, this is significantly decreases Redshift cluster performance in a matter of reading and writing data, because everything (read and write queries) goes through the leader node, and as it says above, when the leader node is overloaded, there will be a performance issue.
    The queries COPY and UNLOAD work directly with the data nodes, therefore, they behave almost the same way as if you would use PARALLEL TRUE. In the contrary, queries like SELECT, UPDATE, DELETE and INSERT, are processed by the leader node, that's why they suffer from the leader node loads.

这篇关于Redshift卸载的文件名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆