如何在R Windows中将Unicode字符串写入文本文件? [英] How to write Unicode string to text file in R Windows?
问题描述
我已经找到了如何编写Unicode字符串的方法,但是仍然对其工作原理感到困惑.
I've figured how to write Unicode strings, but still puzzled by why it works.
str <- "ỏ"
Encoding(str) # UTF-8
cat(str, file="no-iconv") # Written wrongly as <U+1ECF>
cat(iconv(str, to="UTF-8"), file="yes-iconv") # Written correctly as ỏ
我了解为什么no-iconv
方法不起作用.这是因为cat
(以及writeLines
)
I understand why the no-iconv
approach does not work. It's because cat
(and writeLines
as well) convert the string into the native encoding first and then to the to=
encoding. On windows, this means R converts ỏ
to Windows-1252
first, which cannot understand ỏ
, resulting in <U+1ECF>
.
我不明白为什么yes-iconv
方法有效.如果我理解正确,这里的iconv
只是返回具有UTF-8
编码的字符串.但是str
已经在UTF-8
中!为什么iconv
有什么不同?另外,当iconv(str, to="UTF-8")
传递给cat
时,cat
难道不应该通过首先转换为Windows-1252
来再次弄乱一切吗?
What I don't understand is why the yes-iconv
approach works. If I understand correctly, what iconv
does here is simply to return a string with the UTF-8
encoding. But str
is already in UTF-8
! Why should iconv
make any difference? In addition, when iconv(str, to="UTF-8")
is passed to cat
, shouldn't cat
mess everything up once again by first converting to Windows-1252
?
推荐答案
我认为在使用cat()
之前将str
的编码(副本)设置为"unknown"
并不是很神奇,并且效果也很好.我认为应该避免在cat()
中进行任何不需要的字符集转换.
I think setting the Encoding of (a copy of) str
to "unknown"
before using cat()
is less magic and works just as well. I think that should avoid any unwanted character set conversions in cat()
.
以下是扩展的示例,以演示我认为在原始示例中发生的情况:
Here is an expanded example to demonstrate what I think happens in the original example:
print_info <- function(x) {
print(x)
print(Encoding(x))
str(x)
print(charToRaw(x))
}
cat("(1) Original string (UTF-8)\n")
str <- "\xe1\xbb\x8f"
Encoding(str) <- "UTF-8"
print_info(str)
cat(str, file="no-iconv")
cat("\n(2) Conversion to UTF-8, wrong input encoding (latin1)\n")
## from = "" is conversion from current locale, forcing "latin1" here
str2 <- iconv(str, from="latin1", to="UTF-8")
print_info(str2)
cat(str2, file="yes-iconv")
cat("\n(3) Converting (2) explicitly to latin1\n")
str3 <- iconv(str2, from="UTF-8", to="latin1")
print_info(str3)
cat(str3, file="latin")
cat("\n(4) Setting encoding of (1) to \"unknown\"\n")
str4 <- str
Encoding(str4) <- "unknown"
print_info(str4)
cat(str4, file="unknown")
在Windows上R所使用的"Latin-1"
语言环境(请参阅?l10n_info
)中,输出文件"yes-iconv"
,"latin"
和"unknown"
应该正确(字节顺序0xe1
,0xbb
,0x8f
,即"ỏ"
).
In a "Latin-1"
locale (see ?l10n_info
) as used by R on Windows, output files "yes-iconv"
, "latin"
and "unknown"
should be correct (byte sequence 0xe1
, 0xbb
, 0x8f
which is "ỏ"
).
在"UTF-8"
语言环境中,文件"no-iconv"
和"unknown"
应该正确.
In a "UTF-8"
locale, files "no-iconv"
and "unknown"
should be correct.
使用在Wine上运行的R 3.3.2 64位Windows版本,示例代码的输出如下:
The output of the example code is as follows, using R 3.3.2 64-bit Windows version running on Wine:
(1) Original string (UTF-8)
[1] "ỏ"
[1] "UTF-8"
chr "<U+1ECF>""| __truncated__
[1] e1 bb 8f
(2) Conversion to UTF-8, wrong input encoding (latin1)
[1] "á»\u008f"
[1] "UTF-8"
chr "á»\u008f"
[1] c3 a1 c2 bb c2 8f
(3) Converting (2) explicitly to latin1
[1] "á»"
[1] "latin1"
chr "á»"
[1] e1 bb 8f
(4) Setting encoding of (1) to "unknown"
[1] "á»"
[1] "unknown"
chr "á»"
[1] e1 bb 8f
在原始示例中,iconv()
使用默认的from = ""
参数,这意味着从当前语言环境转换,实际上是"latin1".因为str
的编码实际上是"UTF-8",所以字符串的字节表示在步骤(2)中会失真,但是当cat()
(大概)将字符串转换回当前语言环境时,会由cat()
隐式还原. ,如步骤(3)中的等效转换所示.
In the original example, iconv()
uses the default from = ""
argument which means conversion from the current locale, which is effectively "latin1". Because the encoding of str
is actually "UTF-8", the byte representation of the string is distorted in step (2), but then implicitly restored by cat()
when it (presumably) converts the string back to the current locale, as demonstrated by the equivalent conversion in step (3).
这篇关于如何在R Windows中将Unicode字符串写入文本文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!