S3→ Redshift无法处理UTF8 [英] S3 -> Redshift cannot handle UTF8
问题描述
我们在S3中有一个文件,可通过COPY
命令将其加载到Redshift中.导入失败,因为VARCHAR(20)
值包含Ä
,在复制命令期间该Ä
会转换为..
,现在对于20个字符来说太长了.
我已验证S3中的数据正确,但是COPY
命令在导入期间无法理解UTF-8字符.有人找到解决方案了吗?
tl; dr
varchar
列的字节长度只需要更大即可.
详细信息
varchar
数据类型支持多字节字符(UTF-8),但是提供的长度以字节,否个字符为单位./p>
用于多字节字符加载的AWS文档错误指出以下内容:
VARCHAR
列接受多字节UTF-8字符,最多四个字节.
因此,如果要允许字符Ä
,则需要为该字符允许2个字节,而不是1个字节.
有关VARCHAR的AWS文档或CHARACTER VARYING 表示以下内容:
...因此
VARCHAR(120)
列最多包含120个单字节字符,60个二字节字符,40个三字节字符或30个四字节字符.
有关UTF-8字符及其字节长度的列表,这是一个很好的参考: 完整的UTF-8字符列表
有关此处.
We have a file in S3 that is loaded in to Redshift via the COPY
command. The import is failing because a VARCHAR(20)
value contains an Ä
which is being translated into ..
during the copy command and is now too long for the 20 characters.
I have verified that the data is correct in S3, but the COPY
command does not understand the UTF-8 characters during import. Has anyone found a solution for this?
tl;dr
the byte length for your varchar
column just needs to be larger.
Detail
Multi-byte characters (UTF-8) are supported in the varchar
data type, however the length that is provided is in bytes, NOT characters.
AWS documentation for Multibyte Character Load Errors states the following:
VARCHAR
columns accept multibyte UTF-8 characters, to a maximum of four bytes.
Therefore if you want the character Ä
to be allowed, then you need to allow 2 bytes for this character, instead of 1 byte.
AWS documentation for VARCHAR or CHARACTER VARYING states the following:
... so a
VARCHAR(120)
column consists of a maximum of 120 single-byte characters, 60 two-byte characters, 40 three-byte characters, or 30 four-byte characters.
For a list of UTF-8 characters and their byte lengths, this is a good reference: Complete Character List for UTF-8
Detailed information for the Unicode Character 'LATIN CAPITAL LETTER A WITH DIAERESIS' (U+00C4) can be found here.
这篇关于S3→ Redshift无法处理UTF8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!