S3 ->Redshift 无法处理 UTF8 [英] S3 -> Redshift cannot handle UTF8
问题描述
我们在 S3 中有一个文件,它通过 COPY
命令加载到 Redshift.导入失败,因为 VARCHAR(20)
值包含一个 Ä
在复制命令期间被翻译成 ..
并且现在太长为 20 个字符.
We have a file in S3 that is loaded in to Redshift via the COPY
command. The import is failing because a VARCHAR(20)
value contains an Ä
which is being translated into ..
during the copy command and is now too long for the 20 characters.
我已经验证了 S3 中的数据是正确的,但是 COPY
命令在导入过程中无法识别 UTF-8 字符.有没有人为此找到解决方案?
I have verified that the data is correct in S3, but the COPY
command does not understand the UTF-8 characters during import. Has anyone found a solution for this?
推荐答案
tl;dr
varchar
列的字节长度只需要更大.
tl;dr
the byte length for your varchar
column just needs to be larger.
varchar
数据类型支持多字节字符 (UTF-8),但提供的长度以字节为单位,不强> 字符.
Multi-byte characters (UTF-8) are supported in the varchar
data type, however the length that is provided is in bytes, NOT characters.
多字节字符加载的 AWS 文档错误说明如下:
VARCHAR
列接受多字节 UTF-8 字符,最多四个字节.
VARCHAR
columns accept multibyte UTF-8 characters, to a maximum of four bytes.
因此,如果要允许字符Ä
,则需要为该字符允许2个字节,而不是1个字节.
Therefore if you want the character Ä
to be allowed, then you need to allow 2 bytes for this character, instead of 1 byte.
有关 VARCHAR 的 AWS 文档或 CHARACTER VARYING 声明如下:
... 所以 VARCHAR(120)
列最多包含 120 个单字节字符、60 个二字节字符、40 个三字节字符或 30 个四字节字符.
... so a
VARCHAR(120)
column consists of a maximum of 120 single-byte characters, 60 two-byte characters, 40 three-byte characters, or 30 four-byte characters.
有关 UTF-8 字符及其字节长度的列表,这是一个很好的参考:UTF-8 的完整字符列表
For a list of UTF-8 characters and their byte lengths, this is a good reference: Complete Character List for UTF-8
有关 Unicode 字符LATIN CAPITAL LETTER A WITH DIAERESIS"(U+00C4) 的详细信息可以在 这里.
Detailed information for the Unicode Character 'LATIN CAPITAL LETTER A WITH DIAERESIS' (U+00C4) can be found here.
这篇关于S3 ->Redshift 无法处理 UTF8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!