S3 ->Redshift 无法处理 UTF8 [英] S3 -> Redshift cannot handle UTF8

查看:21
本文介绍了S3 ->Redshift 无法处理 UTF8的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们在 S3 中有一个文件,它通过 COPY 命令加载到 Redshift.导入失败,因为 VARCHAR(20) 值包含一个 Ä 在复制命令期间被翻译成 .. 并且现在太长为 20 个字符.

We have a file in S3 that is loaded in to Redshift via the COPY command. The import is failing because a VARCHAR(20) value contains an Ä which is being translated into .. during the copy command and is now too long for the 20 characters.

我已经验证了 S3 中的数据是正确的,但是 COPY 命令在导入过程中无法识别 UTF-8 字符.有没有人为此找到解决方案?

I have verified that the data is correct in S3, but the COPY command does not understand the UTF-8 characters during import. Has anyone found a solution for this?

推荐答案

tl;dr

varchar 列的字节长度只需要更大.

tl;dr

the byte length for your varchar column just needs to be larger.

varchar 数据类型支持多字节字符 (UTF-8),但提供的长度以字节为单位, 字符.

Multi-byte characters (UTF-8) are supported in the varchar data type, however the length that is provided is in bytes, NOT characters.

多字节字符加载的 AWS 文档错误说明如下:

VARCHAR 列接受多字节 UTF-8 字符,最多四个字节.

VARCHAR columns accept multibyte UTF-8 characters, to a maximum of four bytes.

因此,如果要允许字符Ä,则需要为该字符允许2个字节,而不是1个字节.

Therefore if you want the character Ä to be allowed, then you need to allow 2 bytes for this character, instead of 1 byte.

有关 VARCHAR 的 AWS 文档或 CHARACTER VARYING 声明如下:

... 所以 VARCHAR(120) 列最多包含 120 个单字节字符、60 个二字节字符、40 个三字节字符或 30 个四字节字符.

... so a VARCHAR(120) column consists of a maximum of 120 single-byte characters, 60 two-byte characters, 40 three-byte characters, or 30 four-byte characters.

有关 UTF-8 字符及其字节长度的列表,这是一个很好的参考:UTF-8 的完整字符列表

For a list of UTF-8 characters and their byte lengths, this is a good reference: Complete Character List for UTF-8

有关 Unicode 字符LATIN CAPITAL LETTER A WITH DIAERESIS"(U+00C4) 的详细信息可以在 这里.

Detailed information for the Unicode Character 'LATIN CAPITAL LETTER A WITH DIAERESIS' (U+00C4) can be found here.

这篇关于S3 ->Redshift 无法处理 UTF8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆