UTF-8/Unicode 文本编码与 RPostgreSQL [英] UTF-8 / Unicode Text Encoding with RPostgreSQL

查看:79
本文介绍了UTF-8/Unicode 文本编码与 RPostgreSQL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在直接链接到 PostgreSQL 数据库的 Windows 机器上运行 R.我没有使用 RODBC.我的数据库以 UTF-8 编码,如以下 R 命令所确认:

dbGetQuery(con, "SHOW CLIENT_ENCODING")# client_encoding# 1 UTF8

然而,当某些文本被读入 R 时,它在 R 中显示为奇怪的文本.

例如,在我的 PostgreSQL 数据库中显示以下文本:斯蒂芬"

导出到 R 后显示为:史蒂芬"(é 被编码为 é)

导入到 R 时,我使用 dbConnect 命令建立连接,并使用 dbGetQuery 命令使用 SQL 查询数据.在连接到数据库或运行查询时,我没有在任何地方指定任何文本编码.

我在网上搜索过,但找不到直接解决我的问题的方法.我找到了 这个链接,但他们的问题是 RODBC,我是不使用.

此链接有助于识别符号,但我不知道只是想做一个查找&在 R 中替换...太多的数据.

我确实尝试运行以下命令,但收到警告.

Sys.setlocale("LC_ALL", "en_US.UTF-8")# [1] ""# 警告信息:# 在 Sys.setlocale("LC_ALL", "en_US.UTF-8") 中:# 操作系统报告将语言环境设置为en_US.UTF-8"的请求无法得到满足Sys.setenv(LANG="en_US.UTF-8")Sys.setenv(LC_CTYPE="UTF-8")

警告发生在 Sys.setlocale("LC_ALL", "en_US.UTF-8") 命令上.我的直觉是,这是 Windows 特定的问题,在 Mac/Linux/Unix 中不会发生.

解决方案

导出到 R 后显示为:Stéphane"(é 被编码为 é)

您的 R 环境正在使用 1 字节的非组合编码,如 latin-1 或 windows-1252.在 Python 中见证这个测试,证明 é 的 utf-8 字节,就像它们是 latin-1 一样解码,产生你看到的文本:

<预><代码>>>>打印 u"é".encode("utf-8").decode("latin-1")é

SET client_encoding = 'windows-1252' 或修复您的 R 环境使用的编码.如果它在 cmd.exe 控制台中运行,您将需要使用 chcp 控制台命令;否则它特定于您的 R 运行时.

I'm running R on a Windows machine which is directly linked to a PostgreSQL database. I'm not using RODBC. My database is encoded in UTF-8 as confirmed by the following R command:

dbGetQuery(con, "SHOW CLIENT_ENCODING")
#   client_encoding
# 1            UTF8

However, when some text is read into R, it displays as strange text in R.

For example, the following text is shown in my PostgreSQL database: "Stéphane"

After exporting to R it's shown as: "Stéphane" (the é is encoded as é)

When importing to R I use the dbConnect command to establish a connection and the dbGetQuery command to query data using SQL. I do not specify any text encoding anywhere when connecting to the database or when running a query.

I've searched online and can't find a direct resolution to my issue. I found this link, but their issue is with RODBC, which I'm not using.

This link is helpful in identifying the symbols, but I don't just want to do a find & replace in R... way too much data.

I did try running the following commands below and I arrived at a warning.

Sys.setlocale("LC_ALL", "en_US.UTF-8")
# [1] ""
# Warning message:
# In Sys.setlocale("LC_ALL", "en_US.UTF-8") :
#   OS reports request to set locale to "en_US.UTF-8" cannot be honored
Sys.setenv(LANG="en_US.UTF-8")
Sys.setenv(LC_CTYPE="UTF-8")

The warning occurs on the Sys.setlocale("LC_ALL", "en_US.UTF-8") command. My intuition is that this is a Windows specific issue and doesn't occur with Mac/Linux/Unix.

解决方案

After exporting to R it's shown as: "Stéphane" (the é is encoded as é)

Your R environment is using a 1-byte non-composed encoding like latin-1 or windows-1252. Witness this test in Python, demonstrating that the utf-8 bytes for é, decoded as if they were latin-1, produce the text you see:

>>> print u"é".encode("utf-8").decode("latin-1")
é

Either SET client_encoding = 'windows-1252' or fix the encoding your R environment uses. If it's running in a cmd.exe console you'll need to mess with the chcp console command; otherwise it's specific to whatever your R runtime is.

这篇关于UTF-8/Unicode 文本编码与 RPostgreSQL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆