强制从“未知”的字符向量编码到“UTF-8”在R [英] Force character vector encoding from "unknown" to "UTF-8" in R

查看:152
本文介绍了强制从“未知”的字符向量编码到“UTF-8”在R的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在R中字符向量的不一致编码有问题。



我从中读取表格的文本文件通过 UTF-8没有BOM ( Notepad ++ code>,也。



我想从此文本文件中读取表,转换为 data.table ,设置 key 并使用二进制搜索。当我尝试这样做时,出现以下内容:


警告消息:
.table (poli.dt,żżonymi,mult =first):
在连接列中检测到已知的编码(latin1或UTF-8)。 data.table比较当前的字节,因此不支持
mixed 编码;即使用latin1和UTF-8,或者如果任何未知的编码是非ascii并且其中一些编码是已知的,而
其他编码则不是。但是如果仅仅使用latin1或UTF-8,并且所有
未知编码都是ascii,那么结果应该是确定的。在未来
我们会检查你,并避免此警告,如果一切正常。


和二进制搜索 strong>不起作用。



我意识到我的 data.table - 列包含未知和UTF-8编码类型:

 表(Encoding(poli.dt $ word))
unknown UTF-8
2061312 2739122


$ b b

我试图转换这个列(在创建 data.table 对象之前)使用:




  • 编码(字)< - UTF-8

  • word< - enc2utf8(word)





我也尝试了一些不同的方式来读取文件到R中(设置所有有用的参数,例如 encoding =UTF-8 code>):




  • data.table :: fread li>
  • utils :: read.table

  • / code>

  • colbycol :: cbc.read.table



但没有效果。



=================================== ===============



我的版本:

 > R.version 
_
platform x86_64-w64-mingw32
arch x86_64
os mingw32
系统x86_64,mingw32
状态
主要3
minor 0.3
year 2014
month 03
day 06
svn rev 65126
language R
version.string R version 3.0.3(2014) -03-06)
昵称Warm Puppy

我的会话信息:

 > sessionInfo()
R版本3.0.3(2014-03-06)
平台:x86_64-w64-mingw32 / x64(64位)

语言环境:
[1] LC_COLLATE = Polish_Poland.1250 LC_CTYPE = Polish_Poland.1250 LC_MONETARY = Polish_Poland.1250
[4] LC_NUMERIC = C LC_TIME = Polish_Poland.1250

基本软件包:
[ 1] stats graphics grDevices utils数据集方法base

其他附加包:
[1] data.table_1.9.2 colbycol_0.8 filehash_2.2-2 rJava_0.9-6

通过命名空间加载(并未附加):
[1] plyr_1.8.1 Rcpp_0.11.1 reshape2_1.2.2 stringr_0.6.2 tools_3.0.3
编码
函数返回未知

$ p>

解决方案

/ code>如果一个字符串有一个本地编码标记(在你的情况下为CP-1250)或如果它是ASCII。
要区分这两种情况,请调用:

  library(stringi)
stri_enc_mark $ word)

要检查每个字符串是否包含有效的UTF-8字节序列, p>

  all(stri_enc_isutf8(poli.dt $ word))

如果不是这样,那么你的文件根本不是UTF-8。



我怀疑你没有指示R,而读取文件,确实是在UTF-8
(它应该足以看看 poli.dt $ word 的内容验证此语句)。如果我的猜测是真的,请尝试:

  read.csv2(file(filename,encoding =UTF-8) )

 code> poli.dt $ word<  -  stri_encode(poli.dt $ word,,UTF-8)#重新标记编码

如果 data.table 仍然抱怨混合编码,您可能需要音译非ASCII字符,例如:

  stri_trans_general(Zażółćgęśląjaźń,Latin-ASCII)
## [1] Zazolc gesla jazn


I have a problem with inconsistent encoding of character vector in R.

The text file which I read a table from is encoded (via Notepad++) in UTF-8 (I tried with UTF-8 without BOM, too.).

I want to read table from this text file, convert it do data.table, set a key and make use of binary search. When I tried to do so, the following appeared:

Warning message: In [.data.table(poli.dt, "żżonymi", mult = "first") : A known encoding (latin1 or UTF-8) was detected in a join column. data.table compares the bytes currently, so doesn't support mixed encodings well; i.e., using both latin1 and UTF-8, or if any unknown encodings are non-ascii and some of those are marked known and others not. But if either latin1 or UTF-8 is used exclusively, and all unknown encodings are ascii, then the result should be ok. In future we will check for you and avoid this warning if everything is ok. The tricky part is doing this without impacting performance for ascii-only cases.

and binary search does not work.

I realised that my data.table-key column consists of both: "unknown" and "UTF-8" Encoding types:

> table(Encoding(poli.dt$word))
unknown   UTF-8 
2061312 2739122 

I tried to convert this column (before creating a data.table object) with the use of:

  • Encoding(word) <- "UTF-8"
  • word<- enc2utf8(word)

but with no effect.

I also tried a few different ways of reading a file into R (setting all helpful parameters, e.g. encoding = "UTF-8"):

  • data.table::fread
  • utils::read.table
  • base::scan
  • colbycol::cbc.read.table

but with no effect.

==================================================

My R.version:

> R.version
           _                           
platform       x86_64-w64-mingw32          
arch           x86_64                      
os             mingw32                     
system         x86_64, mingw32             
status                                     
major          3                           
minor          0.3                         
year           2014                        
month          03                          
day            06                          
svn rev        65126                       
language       R                           
version.string R version 3.0.3 (2014-03-06)
nickname       Warm Puppy  

My session info:

> sessionInfo()
R version 3.0.3 (2014-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=Polish_Poland.1250  LC_CTYPE=Polish_Poland.1250                LC_MONETARY=Polish_Poland.1250
[4] LC_NUMERIC=C                   LC_TIME=Polish_Poland.1250    

base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.9.2 colbycol_0.8     filehash_2.2-2   rJava_0.9-6     

loaded via a namespace (and not attached):
[1] plyr_1.8.1     Rcpp_0.11.1    reshape2_1.2.2 stringr_0.6.2  tools_3.0.3   

解决方案

The Encoding function returns unknown if a character string has a "native encoding" mark (CP-1250 in your case) or if it is in ASCII. To discriminate between these two cases, call:

library(stringi)
stri_enc_mark(poli.dt$word)

To check whether each string consists of valid UTF-8 byte sequences, call:

all(stri_enc_isutf8(poli.dt$word))

If it is not the case, then your file is not in UTF-8 at all.

I suspect that you haven't instructed R while reading the file that it is indeed in UTF-8 (it should suffice to look at the contents of poli.dt$word to verify this statement). If my guess is true, try:

read.csv2(file("filename", encoding="UTF-8"))

or

poli.dt$word <- stri_encode(poli.dt$word, "", "UTF-8") # re-mark encodings

If data.table still complains about the "mixed" encodings, you may want to transliterate the non-ASCII characters, e.g.:

stri_trans_general("Zażółć gęślą jaźń", "Latin-ASCII")
## [1] "Zazolc gesla jazn"

这篇关于强制从“未知”的字符向量编码到“UTF-8”在R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆