如何在R Windows中将Unicode字符串写入文本文件? [英] How to write Unicode string to text file in R Windows?

查看:100
本文介绍了如何在R Windows中将Unicode字符串写入文本文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经找到了如何编写Unicode字符串的方法,但是仍然对其工作原理感到困惑.

I've figured how to write Unicode strings, but still puzzled by why it works.

str <- "ỏ"
Encoding(str) # UTF-8
cat(str, file="no-iconv") # Written wrongly as <U+1ECF>
cat(iconv(str, to="UTF-8"), file="yes-iconv") # Written correctly as ỏ

我了解为什么no-iconv方法不起作用.这是因为cat(以及writeLines)

I understand why the no-iconv approach does not work. It's because cat (and writeLines as well) convert the string into the native encoding first and then to the to= encoding. On windows, this means R converts to Windows-1252 first, which cannot understand , resulting in <U+1ECF>.

我不明白为什么yes-iconv方法有效.如果我理解正确,这里的iconv只是返回具有UTF-8编码的字符串.但是str已经在UTF-8中!为什么iconv有什么不同?另外,当iconv(str, to="UTF-8")传递给cat时,cat难道不应该通过首先转换为Windows-1252来再次弄乱一切吗?

What I don't understand is why the yes-iconv approach works. If I understand correctly, what iconv does here is simply to return a string with the UTF-8 encoding. But str is already in UTF-8! Why should iconv make any difference? In addition, when iconv(str, to="UTF-8") is passed to cat, shouldn't cat mess everything up once again by first converting to Windows-1252?

推荐答案

我认为在使用cat()之前将str的编码(副本)设置为"unknown"并不是很神奇,并且效果也很好.我认为应该避免在cat()中进行任何不需要的字符集转换.

I think setting the Encoding of (a copy of) str to "unknown" before using cat() is less magic and works just as well. I think that should avoid any unwanted character set conversions in cat().

以下是扩展的示例,以演示我认为在原始示例中发生的情况:

Here is an expanded example to demonstrate what I think happens in the original example:

print_info <- function(x) {
    print(x)
    print(Encoding(x))
    str(x)
    print(charToRaw(x))
}

cat("(1) Original string (UTF-8)\n")
str <- "\xe1\xbb\x8f"
Encoding(str) <- "UTF-8"
print_info(str)
cat(str, file="no-iconv")

cat("\n(2) Conversion to UTF-8, wrong input encoding (latin1)\n")
## from = "" is conversion from current locale, forcing "latin1" here
str2 <- iconv(str, from="latin1", to="UTF-8")
print_info(str2)
cat(str2, file="yes-iconv")

cat("\n(3) Converting (2) explicitly to latin1\n")
str3 <- iconv(str2, from="UTF-8", to="latin1")
print_info(str3)
cat(str3, file="latin")

cat("\n(4) Setting encoding of (1) to \"unknown\"\n")
str4 <- str
Encoding(str4) <- "unknown"
print_info(str4)
cat(str4, file="unknown")

在Windows上R所使用的"Latin-1"语言环境(请参阅?l10n_info)中,输出文件"yes-iconv""latin""unknown"应该正确(字节顺序0xe10xbb0x8f,即"ỏ").

In a "Latin-1" locale (see ?l10n_info) as used by R on Windows, output files "yes-iconv", "latin" and "unknown" should be correct (byte sequence 0xe1, 0xbb, 0x8f which is "ỏ").

"UTF-8"语言环境中,文件"no-iconv""unknown"应该正确.

In a "UTF-8" locale, files "no-iconv" and "unknown" should be correct.

使用在Wine上运行的R 3.3.2 64位Windows版本,示例代码的输出如下:

The output of the example code is as follows, using R 3.3.2 64-bit Windows version running on Wine:

(1) Original string (UTF-8)
[1] "ỏ"
[1] "UTF-8"
 chr "<U+1ECF>""| __truncated__
[1] e1 bb 8f

(2) Conversion to UTF-8, wrong input encoding (latin1)
[1] "á»\u008f"
[1] "UTF-8"
 chr "á»\u008f"
[1] c3 a1 c2 bb c2 8f

(3) Converting (2) explicitly to latin1
[1] "á»"
[1] "latin1"
 chr "á»"
[1] e1 bb 8f

(4) Setting encoding of (1) to "unknown"
[1] "á»"
[1] "unknown"
 chr "á»"
[1] e1 bb 8f

在原始示例中,iconv()使用默认的from = ""参数,这意味着从当前语言环境转换,实际上是"latin1".因为str的编码实际上是"UTF-8",所以字符串的字节表示在步骤(2)中会失真,但是当cat()(大概)将字符串转换回当前语言环境时,会由cat()隐式还原. ,如步骤(3)中的等效转换所示.

In the original example, iconv() uses the default from = "" argument which means conversion from the current locale, which is effectively "latin1". Because the encoding of str is actually "UTF-8", the byte representation of the string is distorted in step (2), but then implicitly restored by cat() when it (presumably) converts the string back to the current locale, as demonstrated by the equivalent conversion in step (3).

这篇关于如何在R Windows中将Unicode字符串写入文本文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆