R:将“特殊"转换为字母转换为UTF-8? [英] R: Converting "special" letters into UTF-8?

查看:251
本文介绍了R:将“特殊"转换为字母转换为UTF-8?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到匹配表的问题,其中一个数据帧包含特殊字符,而另一个数据帧则没有特殊字符.示例:做一个ñ安娜县与多纳安娜县

I run into problems matching tables where one dataframe contains special characters and the other doesn't. Example: Doña Ana County vs. Dona Ana County

这是一个脚本,您可以在其中复制输出:

Here is a script where you can reproduce the outputs:

library(tidyverse)
library(acs)
tbl_df(acs::fips.place)    # contains "Do\xf1a Ana County"
tbl_df(tigris::fips_codes) # contains "Dona Ana County"

示例:

tbl_df(tigris::fips_codes) %>% filter(county == "Dona Ana County")

返回:

# A tibble: 1 x 5
  state state_code state_name county_code          county
  <chr>      <chr>      <chr>       <chr>           <chr>
1    NM         35 New Mexico         013 Dona Ana County

不幸的是,以下查询未返回任何内容:

Unfortunately, following queries return nothing:

tbl_df(acs::fips.place) %>% filter(COUNTY == "Do\xf1a Ana County")
tbl_df(acs::fips.place) %>% filter(COUNTY == "Doña Ana County")
tbl_df(acs::fips.place) %>% filter(COUNTY == "Dona Ana County")

# A tibble: 0 x 7
# ... with 7 variables: STATE <chr>, STATEFP <int>, PLACEFP <int>, PLACENAME <chr>, TYPE <chr>, FUNCSTAT <chr>, COUNTY <chr>

但是,在R Studio中打开数据框时,它显示:

However, when opening the dataframe in R Studio, it shows:

问题1:,尽管数据库中出现了"Do \ xf1a Ana County",但第二个查询为什么没有返回?

Question 1: Why does the second query give no return, though "Do\xf1a Ana County" appears in the database?

问题2:如何将所有特殊"字符(例如ñ)转换为 n 或类似的字符(UTF-8?) ?是否为此提供了一个库或片段,或者在标题中提供了定义,而不是为每个字符都定义了规则?无论如何,我都必须这样做以匹配两个表中的某些列.

Question 2: How can I convert all "special" characters such as ñ into n, or similar (UTF-8?)? Is there a library or snippet for that, or definition in the header, instead of defining rules for every character? I would have to do this anyways in order to match certain columns from both tables.

谢谢!

推荐答案

使用

 tbl_df(acs::fips.place) %>% filter(COUNTY == "Do\\xf1a Ana County")

在数据集中,您真正拥有的是Do\\xf1a,您可以在R控制台中使用例如以下命令进行检查:

In your dataset what you really have is Do\\xf1a you can check this in the R console by using for instance :

acs::fips.place[grep("Ana",f$COUNTY),]

要使用的功能是iconv(x, from = "", to = "")enc2utf8enc2native,它们不带"from"参数. 在大多数情况下,要构建软件包,您需要将数据转换为UTF-8(构建软件包时,我必须对所有法语字符串进行转码).在这里,我认为它是latin1,但是\已被转义.

The functions to use are iconv(x, from = "", to = "") or enc2utf8 or enc2native which don't take a "from" argument. In most cases to build a package you need to convert data to UTF-8 (I have to transcode all my French strings when building packages). Here I think it's latin1, but the \ has been escaped.

x<-"Do\\xf1a Ana County"
Encoding(x)<-"latin1"
charToRaw(x)
#  [1] 44 6f f1 61 20 41 6e 61 20 43 6f 75 6e 74 79
xx<-iconv(x, "latin1", "UTF-8")
charToRaw(xx)
# [1] 44 6f c3 b1 61 20 41 6e 61 20 43 6f 75 6e 74 79

最后,如果您需要清理输出以获取可比较的字符串,则可以使用此功能(直接从我自己的编码地狱开始).

Finally if you need to clean up your output to get comparable strings you can use this function (straight from my own encoding hell).

to.plain <- function(s) {   
   #old1 <- iconv("èéêëù","UTF8") #use this if your console is in LATIN1
   #new1 <- iconv("eeeeu","UTF8") #use this if your console is in LATIN1
  old1 <- "èéêëù"
  new1 <- "eeeeu"
  s1 <- chartr(old1, new1, s)      
}

这篇关于R:将“特殊"转换为字母转换为UTF-8?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆