Postgresql COPY编码,如何? [英] Postgresql COPY encoding, how to?

查看:516
本文介绍了Postgresql COPY编码,如何?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在导入一个包含imdb信息(例如电影名称,movieid,演员,导演,评分票等)的.txt文件,我是使用COPY语句导入的。我正在使用Ubuntu 64位。问题是,有些演员的名字不同,例如乔纳斯·阿克伦德。这就是为什么postgresql会引发错误:

I am importing a .txt file that contains imdb information(such as moviename, movieid, actors, directors, rating votes etc) I imported it by using the COPY Statement. I am using Ubuntu 64 bit. The problem is, that there are actors having different names, such as Jonas Åkerlund. That is why postgresql throws an error:


错误:演员列缺少数据
上下文:复制电影,第3060行: tt0283003 Spun 2002 6.8 30801 101分钟。乔纳斯Ã
**********错误**********

ERROR: missing data for column "actors" CONTEXT: COPY movies, line 3060: "tt0283003 Spun 2002 6.8 30801 101 mins. Jonas Ã" ********** Error **********

错误:列 actors
的数据缺失SQL状态:22P04
上下文:复制电影,第3060行: tt0283003 Spun 2002 6.8 30801 101分钟。乔纳斯Ã

ERROR: missing data for column "actors" SQL state: 22P04 Context: COPY movies, line 3060: "tt0283003 Spun 2002 6.8 30801 101 mins. Jonas Ã"

我的复制语句如下:

COPY movie FROM '/home/max/Schreibtisch/imdb_top100t.txt' (DELIMITER E'\t', FORMAT CSV, NULL '');

我不完全知道如何使用排序规则。请问你能帮帮我吗?一如既往,谢谢。

I do not exactly know, how to use the collation statement. Could you help me please? As always, thank you.

推荐答案

排序规则仅确定字符串的排序方式。加载和保存它们时,重要的是编码。

Collation only determines how strings are sorted. The important thing when loading and saving them is the encoding.

默认情况下,Postgres将 client_encoding 设置用于 COPY 命令;如果它与文件的编码不匹配,则会遇到类似这样的问题。

By default, Postgres uses your client_encoding setting for COPY commands; if it doesn't match the encoding of the file, you'll run into problems like this.

您会从消息中看到,尝试读取Å ,Postgres首先读取Ã,然后遇到某种错误。 Å的UTF8字节序列为 C3 85 。在 LATIN1 代码页中,C3恰好是Ã,而85是未定义* 。因此,该文件很可能是UTF8,但好像是LATIN1一样被读取。

You can see from the message that while trying to read the "Å", Postgres first read an "Ã", and then encountered some kind of error. The UTF8 byte sequence for "Å" is C3 85. C3 happens to be "Ã" in the LATIN1 codepage, while 85 is undefined*. So it's highly likely that the file is UTF8, but being read as if it were LATIN1.

它应该很简单,就像在<$ c $中指定适当的编码一样c> COPY 命令:

It should be as simple as specifying the appropriate encoding in the COPY command:

COPY movie FROM '/home/max/Schreibtisch/imdb_top100t.txt'
(DELIMITER E'\t', FORMAT CSV, NULL '', ENCODING 'UTF8');






*我相信Postgres实际上映射了这些空白在LATIN1中指向相应的Unicode代码点。 85变为 U + 0085 ,又称下一行,解释了为什么将其视为CSV行终止符。


*I believe Postgres actually maps these "gaps" in LATIN1 to the corresponding Unicode code points. 85 becomes U+0085, a.k.a. "NEXT LINE", which explains why it was treated as a CSV row terminator.

这篇关于Postgresql COPY编码,如何?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆