读外文 [英] Reading foreign characters

查看：164 发布时间：2017/8/17 0:34:19 string r encoding character-encoding string-comparison

本文介绍了读外文的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个数据库，其中包含了我读到R（3.02）的英超足球运动员的名字，但是当涉及到名字中的外来角色的玩家遇到困难时（umlauts，口音等）。下面的代码说明了这一点：

  PlayerData< -read.table（C：\\Users\\Documents \\\Players.csv，quote = NULL，dec =。,, sep =，，stringsAsFactors = F，header = T，fill = T，blank.lines.skip = TRUE）
测试< -PlayerData [c（33655：33656）]]这里的玩家名称是CazorlaÖzil
 
测试[Test $ Player ==Cazorla，]＃输出正确的详细信息
测试[Test $ Player ==Ozil，]＃找不到数据'0 rows> （或0长度行.names）'
< 
 
＃外部字符被处理的例子：
 substr（Özil，1,1）
 [1]Ã
 substr（Özil ，1,2）
 [1]Ö
 substr（Özil，2,2）
 [1]
 substr（Özil 3）
 [1]z

我已经尝试更换字符，如下所述： R：替换字符串中的外来字符，但作为重音我的示例中的字符似乎被看作是两个单独的字符，我不认为它有效。

我将不胜感激任何建议或解决方法。

该文件可以下载 here

解决方案

编辑：您提供的文件似乎使用与系统本机不同的编码

从 stringi 包给出：

  library（'stringi '）
 PlayerDataRaw<  -  stri_read_raw（'〜/ Desktop / PLAYERS.csv'）
 stri_enc_detect（PlayerDataRaw）
 ## [[1]] 
 ## [[1 ]] $编码
 ## [1]ISO-8859-1ISO-8859-2ISO-8859-9IBM424_rtl
 ## 
 ## [ [1]] $语言
 ## [1]enrotr他
 ## 
 ## [[1]] $ Confidence 
 ## [1] 0.25 0.14 0.09 0.02

所以很可能文件在$ code > ISO-8859-1 aka latin1 。幸运的是，R在读取这个文件时不需要对输入进行重新编码 - 它可能只是设置与默认（== native）编码标记不同。您可以加载文件：

  PlayerData< -read.table（'〜/ Desktop / PLAYERS.csv'，
 quote = NULL，dec =。，sep =，，
 stringsAsFactors = FALSE，header = TRUE，fill = TRUE，
 blank.lines.skip = TRUE，encoding ='latin1 '）

现在您可以正确访问各个字符，例如使用 stri_sub 函数：

 测试< -PlayerData [c（33655 ：33656），] 
测试
 ## T离开HA家庭玩家年份
 ## 33655 33654 CrystalPalace 1阿森纳Cazorla 2013 
 ## 33656 33655 CrystalPalace 1阿森纳Özil2013 
 
 stri_sub（Test $ Player，1，length = 1）
 ## [1]CÖ
 stri_sub（Test $ Player，2，length = 1）
 ## [1]az

根据比较字符串，字符串相等的测试结果，重音符号flattened：

  stri_cmp_eq（Özil，Ozil ，stri_opts_collator（strength = 1））
 ## [1] TRUE

您也可以通过使用 iconv 的音译器（我不知道它是否在Windows上可用）摆脱重音字符。

$ b $

$ b $ / code>

或来自 stringi 包（stringi version> = 0.2-2）的非常强大的音译员，：

  stri_trans_general（Test $ Player，'Latin-ASCII'）
 ## [1]Cazorla Ozil

I have a database containing the names of Premiership footballers which I am reading into R (3.02), but am encountering difficulties when it comes to players with foreign characters in their names (umlauts, accents etc.). The code below illustrates this:

PlayerData<-read.table("C:\\Users\\Documents\\Players.csv",quote=NULL, dec = ".",,sep=",", stringsAsFactors=F,header=T,fill=T,blank.lines.skip = TRUE)
Test<-PlayerData[c(33655:33656),] #names of the players here are "Cazorla" "Özil"

Test[Test$Player=="Cazorla",] #Outputs correct details
Test[Test$Player=="Ozil",] # Can not find data '0 rows> (or 0-length row.names)'
<

#Example of how the foreign character is treated:
substr("Özil",1,1)
[1] "Ã"
substr("Özil",1,2)
[1] "Ö"
substr("Özil",2,2)
[1] "
substr("Özil",2,3)
[1] "z

I have tried replacing the characters, as described here: R: Replacing foreign characters in a string, but as the accented characters in my example appear to be read as two seperate characters I do not think it works.

I would be grateful for any suggestions or workarounds.

The file is available for download here.

解决方案

EDIT: It seems that the file you provided uses a different encoding than your system's native one.

An (experimental) encoding detection done by the stri_enc_detect function from the stringi package gives:

library('stringi')
PlayerDataRaw <- stri_read_raw('~/Desktop/PLAYERS.csv')
stri_enc_detect(PlayerDataRaw)
## [[1]]
## [[1]]$Encoding
## [1] "ISO-8859-1" "ISO-8859-2" "ISO-8859-9" "IBM424_rtl"
## 
## [[1]]$Language
## [1] "en" "ro" "tr" "he"
## 
## [[1]]$Confidence
## [1] 0.25 0.14 0.09 0.02

So most likely the file is in ISO-8859-1 a.k.a. latin1. Luckily, R does not have to re-encode the input while reading this file - it may just set a different than default (== native) encoding marking. You can load the file with:

PlayerData<-read.table('~/Desktop/PLAYERS.csv',
    quote=NULL, dec = ".", sep=",", 
    stringsAsFactors=FALSE, header=TRUE, fill=TRUE,
    blank.lines.skip=TRUE, encoding='latin1')

Now you may access individual characters correctly, e.g. with the stri_sub function:

Test<-PlayerData[c(33655:33656),]
Test
##           T          Away H.A    Home  Player Year
## 33655 33654 CrystalPalace   1 Arsenal Cazorla 2013
## 33656 33655 CrystalPalace   1 Arsenal    Özil 2013

stri_sub(Test$Player, 1, length=1)
## [1] "C" "Ö"
stri_sub(Test$Player, 2, length=1)
## [1] "a" "z"

As per comparing strings, here are the results for a test for equality of strings, with accent characters "flattened":

stri_cmp_eq("Özil", "Ozil", stri_opts_collator(strength=1))
## [1] TRUE

You may also get rid of accent characters by using iconv's transliterator (I am not sure whether it is available on Windows, though).

iconv(Test$Player, 'latin1', 'ASCII//TRANSLIT')
## [1] "Cazorla" "Ozil"

Or with a very powerful transliterator from the stringi package (stringi version >= 0.2-2):

stri_trans_general(Test$Player, 'Latin-ASCII')
## [1] "Cazorla" "Ozil"

这篇关于读外文的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

读外文 [英] Reading foreign characters

问题描述

相关文章

开发方法最新文章

热门教程

热门工具

登录关闭

读外文 [英] Reading foreign characters

问题描述

相关文章

开发方法最新文章

热门教程

热门工具

登录 关闭

登录关闭