读取外来字符 [英] Reading foreign characters

查看:270
本文介绍了读取外来字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据库包含我正在读入R(3.02)的英超足球运动员的名字,但是遇到困难时,玩家在他们的名字(umlauts,口音等)的外国字符。下面的代码说明了这一点:

  PlayerData <-read.table(C:\\Users\\Documents \\Players.csv,quote = NULL,dec =。,, sep =,,stringsAsFactors = F,header = T,fill = T,blank.lines.skip = TRUE)
Test< -PlayerData [c(33655:33656),]这里的玩家名称是CazorlaÖzil

Test [Test $ Player ==Cazorla,]#输出正确的详细信息
Test [Test $ Player ==Ozil,]#找不到数据'0 rows> (或0-length row.names)'
<

#如何处理外语字符的示例:
substr(Özil,1,1)
[1]Ã
substr ,1,2)
[1]Ö
substr(Özil,2,2)
[1]
substr(Özil 3)
[1]z

我已经尝试替换字符: R:替换字符串中的外来字符,但作为重音符号在我的例子中的字符看起来像两个单独的字符我不认为它的工作。



我会感谢任何建议或解决方法。



该文件可从此处下载。

解决方案

编辑:您提供的文件似乎使用的编码与系统的原始编码不同。
$ b

通过 <$ c $> stri_enc_detect /stringi.rexamine.comrel =nofollow> stringi 包给出:

 
PlayerDataRaw< - stri_read_raw('〜/ Desktop / PLAYERS.csv')
stri_enc_detect(PlayerDataRaw)
## [[1]]
## [[1] ] $ Encoding
## [1]ISO-8859-1ISO-8859-2ISO-8859-9IBM424_rtl
##
## [ 1]] $ Language
## [1]enrotrhe
##
## [[1]] $ Confidence
# #[1] 0.25 0.14 0.09 0.02

很可能文件位于 ISO-8859-1 aka latin1 。幸运的是,R不需要在读取这个文件时重新编码输入 - 它可能只是设置不同于默认(== native)编码标记。您可以加载文件:

  PlayerData <-read.table('〜/ Desktop / PLAYERS.csv',
quote = NULL,dec =。,sep =,,
stringsAsFactors = FALSE,header = TRUE,fill = TRUE,
blank.lines.skip = TRUE,encoding ='latin1 ')

现在您可以正确访问各个字符,例如与 stri_sub 函数:

  Test< -PlayerData [c(33655 :33656),] 
测试
##主场球员年数
## 33655 33654 CrystalPalace 1 Arsenal Cazorla 2013
## 33656 33655 CrystalPalace 1 ArsenalÖzil2013

stri_sub(Test $ Player,1,length = 1)
## [1]CÖ
stri_sub(Test $ Player,2,length = 1)
## [1]az

(Özil,Ozil)的字符串相等的测试结果:

  stri_cmp_eq ,stri_opts_collat​​or(strength = 1))
## [1] TRUE

通过使用 iconv 的音译器(我不知道它是否在Windows上可用),摆脱重音字符。

  iconv(Test $ Player,'latin1','ASCII // TRANSLIT')
## [1]CazorlaOzil

或者使用 stringi package(stringi version> = 0.2-2):

  stri_trans_general Latin-ASCII')
## [1]CazorlaOzil


I have a database containing the names of Premiership footballers which I am reading into R (3.02), but am encountering difficulties when it comes to players with foreign characters in their names (umlauts, accents etc.). The code below illustrates this:

PlayerData<-read.table("C:\\Users\\Documents\\Players.csv",quote=NULL, dec = ".",,sep=",", stringsAsFactors=F,header=T,fill=T,blank.lines.skip = TRUE)
Test<-PlayerData[c(33655:33656),] #names of the players here are "Cazorla" "Özil"

Test[Test$Player=="Cazorla",] #Outputs correct details
Test[Test$Player=="Ozil",] # Can not find data '0 rows> (or 0-length row.names)'
<

#Example of how the foreign character is treated:
substr("Özil",1,1)
[1] "Ã"
substr("Özil",1,2)
[1] "Ö"
substr("Özil",2,2)
[1] "
substr("Özil",2,3)
[1] "z

I have tried replacing the characters, as described here: R: Replacing foreign characters in a string, but as the accented characters in my example appear to be read as two seperate characters I do not think it works.

I would be grateful for any suggestions or workarounds.

The file is available for download here.

解决方案

EDIT: It seems that the file you provided uses a different encoding than your system's native one.

An (experimental) encoding detection done by the stri_enc_detect function from the stringi package gives:

library('stringi')
PlayerDataRaw <- stri_read_raw('~/Desktop/PLAYERS.csv')
stri_enc_detect(PlayerDataRaw)
## [[1]]
## [[1]]$Encoding
## [1] "ISO-8859-1" "ISO-8859-2" "ISO-8859-9" "IBM424_rtl"
## 
## [[1]]$Language
## [1] "en" "ro" "tr" "he"
## 
## [[1]]$Confidence
## [1] 0.25 0.14 0.09 0.02

So most likely the file is in ISO-8859-1 a.k.a. latin1. Luckily, R does not have to re-encode the input while reading this file - it may just set a different than default (== native) encoding marking. You can load the file with:

PlayerData<-read.table('~/Desktop/PLAYERS.csv',
    quote=NULL, dec = ".", sep=",", 
    stringsAsFactors=FALSE, header=TRUE, fill=TRUE,
    blank.lines.skip=TRUE, encoding='latin1')

Now you may access individual characters correctly, e.g. with the stri_sub function:

Test<-PlayerData[c(33655:33656),]
Test
##           T          Away H.A    Home  Player Year
## 33655 33654 CrystalPalace   1 Arsenal Cazorla 2013
## 33656 33655 CrystalPalace   1 Arsenal    Özil 2013

stri_sub(Test$Player, 1, length=1)
## [1] "C" "Ö"
stri_sub(Test$Player, 2, length=1)
## [1] "a" "z"

As per comparing strings, here are the results for a test for equality of strings, with accent characters "flattened":

stri_cmp_eq("Özil", "Ozil", stri_opts_collator(strength=1))
## [1] TRUE

You may also get rid of accent characters by using iconv's transliterator (I am not sure whether it is available on Windows, though).

iconv(Test$Player, 'latin1', 'ASCII//TRANSLIT')
## [1] "Cazorla" "Ozil"

Or with a very powerful transliterator from the stringi package (stringi version >= 0.2-2):

stri_trans_general(Test$Player, 'Latin-ASCII')
## [1] "Cazorla" "Ozil"

这篇关于读取外来字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆