读取不同编码的 Rdata 文件 [英] Reading Rdata file with different encoding
问题描述
我有一个 .RData 文件要在我的 Linux (UTF-8) 机器上读取,但我知道该文件是 Latin1,因为我是在 Windows 上自己创建的.不幸的是,我无法访问原始文件或 Windows 机器,我需要在我的 Linux 机器上读取这些文件.
I have an .RData file to read on my Linux (UTF-8) machine, but I know the file is in Latin1 because I've created them myself on Windows. Unfortunately, I don't have access to the original files or a Windows machine and I need to read those files on my Linux machine.
要读取 Rdata 文件,通常的程序是运行 load("file.Rdata")
.read.csv
之类的函数有一个 encoding
参数,您可以使用它来解决这类问题,但 load
没有这样的东西.如果我尝试 load("file.Rdata", encoding = latin1)
,我只会得到这个(预期的)错误:
To read an Rdata file, the normal procedure is to run load("file.Rdata")
. Functions such as read.csv
have an encoding
argument that you can use to solve those kind of issues, but load
has no such thing. If I try load("file.Rdata", encoding = latin1)
, I just get this (expected) error:
加载错误("file.Rdata", encoding = "latin1") :未使用的参数(编码 = "latin1")
Error in load("file.Rdata", encoding = "latin1") : unused argument (encoding = "latin1")
我还能做什么?我的文件加载了包含在 UTF-8 环境中打开时会损坏的重音的文本变量.
What else can I do? My files are loaded with text variables containing accents that get corrupted when opened in an UTF-8 environment.
推荐答案
感谢 42 的评论,我已经设法编写了一个函数来重新编码文件:
Thanks to 42's comment, I've managed to write a function to recode the file:
fix.encoding <- function(df, originalEncoding = "latin1") {
numCols <- ncol(df)
for (col in 1:numCols) Encoding(df[, col]) <- originalEncoding
return(df)
}
这里的重点是命令Encoding(df[, col]) <- "latin1"
,它需要数据帧df
的列col
code> 并将其转换为 latin1 格式.不幸的是,Encoding
仅将列对象作为输入,因此我必须创建一个函数来扫描数据帧对象的所有列并应用转换.
The meat here is the command Encoding(df[, col]) <- "latin1"
, which takes column col
of dataframe df
and converts it to latin1 format. Unfortunately, Encoding
only takes column objects as input, so I had to create a function to sweep all columns of a dataframe object and apply the transformation.
当然,如果您的问题仅在几列中,您最好将 Encoding
应用于这些列而不是整个数据框(您可以修改上面的函数以采用一组列作为输入).此外,如果您面临相反的问题,即将在 Linux 或 Mac OS 中创建的 R 对象读入 Windows,您应该使用 originalEncoding = "UTF-8"
.
Of course, if your problem is in just a couple of columns, you're better off just applying the Encoding
to those columns instead of the whole dataframe (you can modify the function above to take a set of columns as input). Also, if you're facing the inverse problem, i.e. reading an R object created in Linux or Mac OS into Windows, you should use originalEncoding = "UTF-8"
.
这篇关于读取不同编码的 Rdata 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!