读外文 [英] Reading foreign characters

查看:164
本文介绍了读外文的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据库,其中包含了我读到R(3.02)的英超足球运动员的名字,但是当涉及到名字中的外来角色的玩家遇到困难时(umlauts,口音等)。下面的代码说明了这一点:

  PlayerData< -read.table(C:\\Users\\Documents \\\Players.csv,quote = NULL,dec =。,, sep =,,stringsAsFactors = F,header = T,fill = T,blank.lines.skip = TRUE)
测试< -PlayerData [c(33655:33656)]]这里的玩家名称是CazorlaÖzil

测试[Test $ Player ==Cazorla,]#输出正确的详细信息
测试[Test $ Player ==Ozil,]#找不到数据'0 rows> (或0长度行.names)'
<

#外部字符被处理的例子:
substr(Özil,1,1)
[1]Ã
substr(Özil ,1,2)
[1]Ö
substr(Özil,2,2)
[1]
substr(Özil 3)
[1]z

我已经尝试更换字符,如下所述: R:替换字符串中的外来字符,但作为重音我的示例中的字符似乎被看作是两个单独的字符,我不认为它有效。



我将不胜感激任何建议或解决方法。



该文件可以下载 here

解决方案

编辑:您提供的文件似乎使用与系统本机不同的编码



stringi 包给出:

  library('stringi ')
PlayerDataRaw< - stri_read_raw('〜/ Desktop / PLAYERS.csv')
stri_enc_detect(PlayerDataRaw)
## [[1]]
## [[1 ]] $编码
## [1]ISO-8859-1ISO-8859-2ISO-8859-9IBM424_rtl
##
## [ [1]] $语言
## [1]enrotr他
##
## [[1]] $ Confidence
## [1] 0.25 0.14 0.09 0.02

所以很可能文件在$ code > ISO-8859-1 aka latin1 。幸运的是,R在读取这个文件时不需要对输入进行重新编码 - 它可能只是设置与默认(== native)编码标记不同。您可以加载文件:

  PlayerData< -read.table('〜/ Desktop / PLAYERS.csv',
quote = NULL,dec =。,sep =,,
stringsAsFactors = FALSE,header = TRUE,fill = TRUE,
blank.lines.skip = TRUE,encoding ='latin1 ')

现在您可以正确访问各个字符,例如使用 stri_sub 函数:

 测试< -PlayerData [c(33655 :33656),] 
测试
## T离开HA家庭玩家年份
## 33655 33654 CrystalPalace 1阿森纳Cazorla 2013
## 33656 33655 CrystalPalace 1阿森纳Özil2013

stri_sub(Test $ Player,1,length = 1)
## [1]CÖ
stri_sub(Test $ Player,2,length = 1)
## [1]az

根据比较字符串,字符串相等的测试结果,重音符号flattened:

  stri_cmp_eq(Özil,Ozil ,stri_opts_collat​​or(strength = 1))
## [1] TRUE

您也可以通过使用 iconv 的音译器(我不知道它是否在Windows上可用)摆脱重音字符。


$ b $









$ b $ / code>

或来自 stringi 包(stringi version> = 0.2-2)的非常强大的音译员, :

  stri_trans_general(Test $ Player,'Latin-ASCII')
## [1]Cazorla Ozil


I have a database containing the names of Premiership footballers which I am reading into R (3.02), but am encountering difficulties when it comes to players with foreign characters in their names (umlauts, accents etc.). The code below illustrates this:

PlayerData<-read.table("C:\\Users\\Documents\\Players.csv",quote=NULL, dec = ".",,sep=",", stringsAsFactors=F,header=T,fill=T,blank.lines.skip = TRUE)
Test<-PlayerData[c(33655:33656),] #names of the players here are "Cazorla" "Özil"

Test[Test$Player=="Cazorla",] #Outputs correct details
Test[Test$Player=="Ozil",] # Can not find data '0 rows> (or 0-length row.names)'
<

#Example of how the foreign character is treated:
substr("Özil",1,1)
[1] "Ã"
substr("Özil",1,2)
[1] "Ö"
substr("Özil",2,2)
[1] "
substr("Özil",2,3)
[1] "z

I have tried replacing the characters, as described here: R: Replacing foreign characters in a string, but as the accented characters in my example appear to be read as two seperate characters I do not think it works.

I would be grateful for any suggestions or workarounds.

The file is available for download here.

解决方案

EDIT: It seems that the file you provided uses a different encoding than your system's native one.

An (experimental) encoding detection done by the stri_enc_detect function from the stringi package gives:

library('stringi')
PlayerDataRaw <- stri_read_raw('~/Desktop/PLAYERS.csv')
stri_enc_detect(PlayerDataRaw)
## [[1]]
## [[1]]$Encoding
## [1] "ISO-8859-1" "ISO-8859-2" "ISO-8859-9" "IBM424_rtl"
## 
## [[1]]$Language
## [1] "en" "ro" "tr" "he"
## 
## [[1]]$Confidence
## [1] 0.25 0.14 0.09 0.02

So most likely the file is in ISO-8859-1 a.k.a. latin1. Luckily, R does not have to re-encode the input while reading this file - it may just set a different than default (== native) encoding marking. You can load the file with:

PlayerData<-read.table('~/Desktop/PLAYERS.csv',
    quote=NULL, dec = ".", sep=",", 
    stringsAsFactors=FALSE, header=TRUE, fill=TRUE,
    blank.lines.skip=TRUE, encoding='latin1')

Now you may access individual characters correctly, e.g. with the stri_sub function:

Test<-PlayerData[c(33655:33656),]
Test
##           T          Away H.A    Home  Player Year
## 33655 33654 CrystalPalace   1 Arsenal Cazorla 2013
## 33656 33655 CrystalPalace   1 Arsenal    Özil 2013

stri_sub(Test$Player, 1, length=1)
## [1] "C" "Ö"
stri_sub(Test$Player, 2, length=1)
## [1] "a" "z"

As per comparing strings, here are the results for a test for equality of strings, with accent characters "flattened":

stri_cmp_eq("Özil", "Ozil", stri_opts_collator(strength=1))
## [1] TRUE

You may also get rid of accent characters by using iconv's transliterator (I am not sure whether it is available on Windows, though).

iconv(Test$Player, 'latin1', 'ASCII//TRANSLIT')
## [1] "Cazorla" "Ozil"

Or with a very powerful transliterator from the stringi package (stringi version >= 0.2-2):

stri_trans_general(Test$Player, 'Latin-ASCII')
## [1] "Cazorla" "Ozil"

这篇关于读外文的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆