将data.fame对象从latin1重新编码为utf-8 [英] Recoding data.fame object from latin1 to utf-8

查看:112
本文介绍了将data.fame对象从latin1重新编码为utf-8的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Windows 7(我的系统:"LC_COLLATE = French_France.1252")上使用带有重音符号的数据.
我的数据是用ANSI编码的,这使我可以在Rstudio的选项卡中正确地可视化它们.

I work with windows 7 (my system: "LC_COLLATE=French_France.1252) with data with accents.
My data are coded in ANSI which allows me to visualize them correctly in the tabs of Rstudio.

我的问题:当我想创建GoogleVis页面(对utf-8进行编码)时,重音字符无法正确显示.

My problem: When I want to a create GoogleVis page (encoding utf-8), the accented characters are not displayed correctly.

我期望的是:我想在创建googleVis页面之前用R转换utf-8中的latin1 Data.frames. 我没主意Stringi软件包似乎仅适用于原始数据.

What I expected: I am looking to convert my latin1 Data.frames in utf-8 with R just before creating googleVis pages. I have no ideas. Stringi package seems only to work with raw data.

fr <- data.frame(âge = c(15,20), prénom = c("Adélia", "Adão"), row.names = c("I1", "I2"))

print (fr)

library (googleVis)

test <- gvisTable(fr)
plot(fr)

真实数据 https://drive.google.com/open?id=0B91cr4hfMXV4OEkzWk1aWlhvR0E

# importing (historical data)
test_ansi<-read.table("databig_ansi.csv",
                header=TRUE, sep=",",
                na.strings="",
                quote = "\"",
                dec=".") 

# subsetting 
library (dplyr)
test_ansi <- 
   test_ansi %>%
   count(ownera)

# library (stringi)

  stri_enc_detect(test_ansi$ownera)

# visualisation
library (googleVis)

testvis <- gvisTable(test_ansi)
plot(testvis)

推荐答案

几个包中都有内置函数,例如stringistringrSoundexBRtau以及字符在R基本系统中进行转换,可以用作:

There are built-in functions in several packages, such as stringi, stringr, SoundexBR, tau, as well as a character convert in the R base system, which can be used as:

text2 <- iconv(text, from = "latin1", to = "UTF-8")

您可能还需要一个更具体的功能,其中包括一些因素检查,例如:

You may also want a more specific function with some checks for factors, like the following:

.fromto <- function (x, from, to)
{
    if (is.list(x)) {
    xattr <- attributes(x)
    x <- lapply(x, .fromto, from, to)
    attributes(x) <- xattr
    } else {
    if (is.factor(x)) {
        levels(x) <- iconv(levels(x), from, to, sub = "byte")
    } else {
        if (is.character(x))
        x <- iconv(x, from, to, sub = "byte")
    }
    lb <- attr(x, "label")
    if (length(lb) > 0) {
        attr(x, "label") <- iconv(attr(x, "label"), from, to, sub = "byte")
    }
    }
    x
}

# This will convert a vector from any encoding into UTF-8
Latin2UTF8 <- function (x, from = "WINDOWS-1252")
{
    .fromto(x, from, "UTF-8")
} 

然后您将其用作:

Latin2UTF8(fr)
 âge prénom
I1  15 Adélia
I2  20   Adão

在获得额外的信息和数据后进行额外的编辑

这是我的R设置的方式.默认情况下,我的R在UTF-8语言环境和英语下运行.一旦我的系统环境不同于所提供的文件编码,我将使用fileEncoding = "LATIN1".就是这样.

This is how my R is setup. My R is running on UTF-8 locale and English by default. Once my system environment differs from the file encoding provided, I'll use fileEncoding = "LATIN1". That is all.

> Sys.getlocale()
[1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"


test_ansi<-read.table(file.choose(),
                       header=TRUE, sep=",",
                       na.strings="",
                        quote = "\"",
                        dec=".", fileEncoding = "LATIN1")

> test_ansi2 <- 
+     test_ansi %>%
+     count(ownera)
> test_ansi2
Source: local data frame [6,482 x 2]

                ownera n
1       Abautret (Vve) 1
2              Abazuza 1
3            Abernathy 1
4  Abrahamsen, Heerman 1
5  Abrahamsen, Hereman 6
6   Abrahamsz, Heerman 2
7         Abram, Ralph 8
8      Abrams, Heerman 2
9            Abranches 1
10               Abreu 1
..                 ... .

# visualisation
library (googleVis)


testvis <- gvisTable(test_ansi)
plot(testvis)

链接到已创建的表

这篇关于将data.fame对象从latin1重新编码为utf-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆