S3→ Redshift无法处理UTF8 [英] S3 -> Redshift cannot handle UTF8

查看:157
本文介绍了S3→ Redshift无法处理UTF8的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们在S3中有一个文件,可通过COPY命令将其加载到Redshift中.导入失败,因为VARCHAR(20)值包含Ä,在复制命令期间该Ä会转换为..,现在对于20个字符来说太长了.

我已验证S3中的数据正确,但是COPY命令在导入期间无法理解UTF-8字符.有人找到解决方案了吗?

解决方案

tl; dr

varchar列的字节长度只需要更大即可.

详细信息

varchar数据类型支持多字节字符(UTF-8),但是提供的长度以字节个字符为单位./p>

用于多字节字符加载的AWS文档错误指出以下内容:

VARCHAR列接受多字节UTF-8字符,最多四个字节.

因此,如果要允许字符Ä,则需要为该字符允许2个字节,而不是1个字节.

有关VARCHAR的AWS文档或CHARACTER VARYING 表示以下内容:

...因此VARCHAR(120)列最多包含120个单字节字符,60个二字节字符,40个三字节字符或30个四字节字符.

有关UTF-8字符及其字节长度的列表,这是一个很好的参考: 完整的UTF-8字符列表

有关 解决方案

tl;dr

the byte length for your varchar column just needs to be larger.

Detail

Multi-byte characters (UTF-8) are supported in the varchar data type, however the length that is provided is in bytes, NOT characters.

AWS documentation for Multibyte Character Load Errors states the following:

VARCHAR columns accept multibyte UTF-8 characters, to a maximum of four bytes.

Therefore if you want the character Ä to be allowed, then you need to allow 2 bytes for this character, instead of 1 byte.

AWS documentation for VARCHAR or CHARACTER VARYING states the following:

... so a VARCHAR(120) column consists of a maximum of 120 single-byte characters, 60 two-byte characters, 40 three-byte characters, or 30 four-byte characters.

For a list of UTF-8 characters and their byte lengths, this is a good reference: Complete Character List for UTF-8

Detailed information for the Unicode Character 'LATIN CAPITAL LETTER A WITH DIAERESIS' (U+00C4) can be found here.

这篇关于S3→ Redshift无法处理UTF8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆