R：即使指定编码，也无法读取unicode文本文件 [英] R: can't read unicode text files even when specifying the encoding

查看：166 发布时间：2020/10/29 6:09:54 windows r unicode encoding ucs2

本文介绍了R：即使指定编码，也无法读取unicode文本文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在Windows 7 32位系统上使用R 3.1.1。我在阅读要对其进行文本分析的一些文本文件时遇到很多问题。根据Notepad ++，文件使用 UCS-2 Little Endian 进行编码。（grepWin，一个名称全部说明该工具的文件，它表示文件为 Unicode。）

I'm using R 3.1.1 on Windows 7 32bits. I'm having a lot of problems reading some text files on which I want to perform textual analysis. According to Notepad++, the files are encoded with "UCS-2 Little Endian". (grepWin, a tool whose name says it all, says the file is "Unicode".)

问题是，即使指定，我似乎也无法读取文件该编码。（这些字符是标准的西班牙语拉丁语集-ñáó-，应使用CP1252或类似的符号轻松处理。）

The problem is that I can't seem to read the file even specifying that encoding. (The characters are of the standard spanish Latin set -ñáó- and should be handled easily with CP1252 or anything like that.)

> Sys.getlocale()
[1] "LC_COLLATE=Spanish_Spain.1252;LC_CTYPE=Spanish_Spain.1252;LC_MONETARY=Spanish_Spain.1252;LC_NUMERIC=C;LC_TIME=Spanish_Spain.1252"
> readLines("filename.txt")
 [1] "ÿþE" ""    ""    ""    ""   ...
> readLines("filename.txt",encoding="UTF-8")
 [1] "\xff\xfeE" ""          ""          ""          ""    ...
> readLines("filename.txt",encoding="UCS2LE")
 [1] "ÿþE" ""    ""    ""    ""    ""    ""     ...
> readLines("filename.txt",encoding="UCS2")
 [1] "ÿþE" ""    ""    ""    ""    ...

有什么想法吗？

谢谢！

编辑： UTF-16， UTF-16LE和 UTF-16BE编码类似地失败

edit: the "UTF-16", "UTF-16LE" and "UTF-16BE" encondings fails similarly

推荐答案

仔细阅读文档后，我找到了问题的答案。

After reading more closely to the documentation, I found the answer to my question.

编码 readLines 的code>参数仅应用于参数输入字符串。该文档说：

The encoding param of readLines only applies to the param input strings. The documentation says:

假定对输入字符串进行编码。它用于标记字符
字符串，如拉丁语1或UTF-8所示：它不用于
重新编码输入。为此，请在连接con或通过options（encoding =）的
中指定编码：请参见示例。

encoding to be assumed for input strings. It is used to mark character strings as known to be in Latin-1 or UTF-8: it is not used to re-encode the input. To do the latter, specify the encoding as part of the connection con or via options(encoding=): see the examples. See also ‘Details’.

那么，读取具有不常见编码的文件的正确方法是

The proper way of reading a file with an uncommon encoding is, then,

filetext <- readLines(con <- file("UnicodeFile.txt", encoding = "UCS-2LE"))
close(con)

这篇关于R：即使指定编码，也无法读取unicode文本文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

R：即使指定编码，也无法读取unicode文本文件 [英] R: can't read unicode text files even when specifying the encoding

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

R：即使指定编码，也无法读取unicode文本文件 [英] R: can&#39;t read unicode text files even when specifying the encoding

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

R：即使指定编码，也无法读取unicode文本文件 [英] R: can't read unicode text files even when specifying the encoding

登录关闭