读取UTF-8文本文件(在希伯来语中)在RStudio的控制台中显示gibrish并且在RGUI中很好 [英] Reading a UTF-8 text file (in Hebrew) shows gibrish in RStudio's console and fine in RGUI

查看:214
本文介绍了读取UTF-8文本文件(在希伯来语中)在RStudio的控制台中显示gibrish并且在RGUI中很好的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我正在读一个csv文件到R中。当我把它打印到Rtudio中时,控制台在RStudio我得到gibrish(除非我看一个特定的向量)。而在Rgui这是很好。



我将运行的代码是:

  Sys.setlocale(LC_ALL,Hebrew)
x< - read.csv(https://raw.githubusercontent.com/talgalili/temp2/gh-pages/Hebrew_UTF8.txt ,encoding =UTF-8)
x#显示gibrish
x [,2]
colnames(x)

这是RStudio(gibrish)的输出。

  x<  -  read.csv(https://raw.githubusercontent.com/talgalili/temp2/gh-pages/Hebrew_UTF8.txt,encoding =UTF-8)
> x
âéì..áùðéí。 ééãã
1 23.0æëø
2 24.0ð÷áä
3 23.0ð÷áä
4 24.0ð÷áä
5 25.0æëø
6 18.0æëø
7 26.0æëø
8 21.5ð÷áä
9 24.0æëø
10 26.0æëø
11 24.0æëø
12 19.0ð÷áä
13 19.0ð÷áä
14 24.5æëø
15 21.0ð÷áä
> x [,2]
[1]ז;הההההההההההההההההההההההה
colnames(x)
[1]âéì..áùðéí。 îéâãø
>

在这里它是Rgui(这里很好):

 > x<  -  read.csv(https://raw.githubusercontent.com/talgalili/temp2/gh-pages/Hebrew_UTF8.txt,encoding =UTF-8)
> x#显示gibrish
גיל..בשנים。 מיגדר
1 23.0זכר
2 24.0נקבה
3 23.0נקבה
4 24.0נקבה
5 25.0זכר
6 18.0זכר
7 26.0זכר
8 21.5נקבה
9 24.0זכר
10 26.0זכר
11 24.0זכר
12 19.0נקבה
13 19.0נקבה
14 24.5זכר
15 21.0נקבה
> x [,2]
[1]ז;הההההההההההההההההההההההה
colnames(x)
[1]גיל..בשנים。 מיגדר
>

在这两个会话中,我的sessionInfo()是:

 > sessionInfo()
R版本3.2.3(2015-12-10)
平台:x86_64-w64-mingw32 / x64(64位)
运行时:Windows 7 x64 )Service Pack 1

locale:
[1] LC_COLLATE = Hebrew_Israel.1255 LC_CTYPE = Hebrew_Israel.1255
[3] LC_MONETARY = Hebrew_Israel.1255 LC_NUMERIC = C
[5] LC_TIME = Hebrew_Israel.1255

附加的基本包:
[1] stats graphics grDevices datasets utils方法base

其他附加包:
[1] installr_0.17.0

我使用的是最新的RStudio版本0.99.892



感谢。

解决方案

这是R-studio中的错误,一。我看过您已经收到了关于R-studio目前在Windows上支持非英语语言环境的问题的一般回答。据我所知,这不是第一次/版本有类似的问题。您还可能遇到一些我认为与win 10相关的新问题 。注意,因为我也有第二种类型的问题,我使用英语区域设置打印希伯来语。



所以我试过一些调试你的问题,有一些解决方法,一些新的见解(我认为..)在哪里的问题。我认为它可以进一步调试写一个完整的函数,将修复它,但由于时间(和小时)限制我决定停止这里。



我've created this data:

  x < -  data.frame(x= c(דור,dor ))

如前所述,使用希伯来语区域设置I以及获取

  Sys.setlocale(LC_ALL,Hebrew)
[1]LC_COLLATE = Hebrew_Israel.1255; LC_CTYPE = Hebrew_Israel.1255; LC_MONETARY = Hebrew_Israel.1255; LC_NUMERIC = C; LC_TIME = Hebrew_Israel.1255

דור
[1]ãåø

x
x
1ãåø
2 dor

使用英语区域设置, 。

  Sys.setlocale(LC_ALL,English)
[1]LC_COLLATE = English_United States。 1252; LC_CTYPE = English_United States.1252; LC_MONETARY = English_United States.1252; LC_NUMERIC = C; LC_TIME = English_United States.1252

דור
[1]דור

x
x
1< U + 05D3>< U + 05D5>< U + 05E8&
2 dor

注意, data.frame 输出打印精细。也可以使用 data.table 类进行打印,并使用 list code>。



检查 print.data.frame 方法揭示主要嫌疑人: format



进一步调查证实这些怀疑:

  as.matrix )
x
[1,]
[2,]dor

格式(as.matrix(x))
x
[1,]< U + 05D3>< U + 05D5>< U + 05E8&
[2,]dor

  Sys.setlocale(LC_ALL,Hebrew)
x< - read.csv(https ://raw.githubusercontent.com/talgalili/temp2/gh-pages/Hebrew_UTF8.txt,encoding =UTF-8)
as.matrix(x)
âéã..áùðéí。 îéâ€
[1,]23.0זכר
[2,]24.0נקבה
[3,]23.0נקבה
[ ]24.0נקבה
[5,]25.0זכר
[6,]18.0זכר
[7,]26.0 b $ b [8,]21.5נקבה
[9,]24.0זכר
[10,]26.0זכר
[11, 24.0זכר
[12,]19.0נקבה
[13,]19.0נקבה
[14,]24.5זכר
[15,]21.0נקבה

两个地区:希伯来语和英语在我的机器上工作,但 col.names 对两者都不起作用。



总而言之,这不是一个完整的解决方案,而是一个小的和部分的工作,打印(或记得格式化)问题。它还在R-studio中对这个希伯来语/非英语问题有了更多的了解,可以在其中写出一些更好的解决方案。在Windows中编写希伯来语的类似问题的解决方案的一个例子可以看到在此SO线程


I am trying to understand if this is a bug in RStudio or am I missing something.

I am reading a csv file into R. When printing it into the console in RStudio I get gibrish (unless I look at a specific vector). While in Rgui this is fine.

The code I will run is this:

Sys.setlocale("LC_ALL", "Hebrew")
x <- read.csv("https://raw.githubusercontent.com/talgalili/temp2/gh-pages/Hebrew_UTF8.txt", encoding="UTF-8")  
x # shows gibrish
x[,2]
colnames(x)

Here is the output from RStudio (gibrish)

> x <- read.csv("https://raw.githubusercontent.com/talgalili/temp2/gh-pages/Hebrew_UTF8.txt", encoding="UTF-8")
> x
   âéì..áùðéí. îéâãø
1         23.0   æëø
2         24.0  ð÷áä
3         23.0  ð÷áä
4         24.0  ð÷áä
5         25.0   æëø
6         18.0   æëø
7         26.0   æëø
8         21.5  ð÷áä
9         24.0   æëø
10        26.0   æëø
11        24.0   æëø
12        19.0  ð÷áä
13        19.0  ð÷áä
14        24.5   æëø
15        21.0  ð÷áä
> x[,2]
 [1] זכר  נקבה נקבה נקבה זכר  זכר  זכר  נקבה זכר  זכר  זכר  נקבה נקבה זכר  נקבה
Levels: זכר נקבה
> colnames(x)
[1] "âéì..áùðéí." "îéâãø"      
> 

And here it is in Rgui (here it is fine):

>     x <- read.csv("https://raw.githubusercontent.com/talgalili/temp2/gh-pages/Hebrew_UTF8.txt", encoding="UTF-8")  
>     x # shows gibrish
   גיל..בשנים. מיגדר
1         23.0   זכר
2         24.0  נקבה
3         23.0  נקבה
4         24.0  נקבה
5         25.0   זכר
6         18.0   זכר
7         26.0   זכר
8         21.5  נקבה
9         24.0   זכר
10        26.0   זכר
11        24.0   זכר
12        19.0  נקבה
13        19.0  נקבה
14        24.5   זכר
15        21.0  נקבה
>     x[,2]
 [1] זכר  נקבה נקבה נקבה זכר  זכר  זכר  נקבה זכר  זכר  זכר  נקבה נקבה זכר  נקבה
Levels: זכר נקבה
>     colnames(x)
[1] "גיל..בשנים." "מיגדר"      
> 

In both sessions, my sessionInfo() is:

> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=Hebrew_Israel.1255  LC_CTYPE=Hebrew_Israel.1255   
[3] LC_MONETARY=Hebrew_Israel.1255 LC_NUMERIC=C                  
[5] LC_TIME=Hebrew_Israel.1255    

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] installr_0.17.0

I'm using the latest RStudio version 0.99.892

Thanks.

解决方案

This is a bug in R-studio and not the only one. I've seen you have received a general answer about problems R-studio currently having with non-English locale support on windows. As far as I know it is not the first time / version having similar problems. You may also meet some new problems that I think related to win 10 . Note that since I'm having the second type of problems as well, I am using English locale to print Hebrew.

So I have tried some debugging on your problem there and came with some work-around, and some new insights (I think..) on where is the problem. I think it can be further debugged to write a complete function that will fix it, but due to time (and hour) restrictions I've decide to stop here.

I've created this data:

x <- data.frame("x"= c("דור","dor"))

As mentioned already, using Hebrew locale I as well get gibrish

Sys.setlocale("LC_ALL", "Hebrew")
[1] "LC_COLLATE=Hebrew_Israel.1255;LC_CTYPE=Hebrew_Israel.1255;LC_MONETARY=Hebrew_Israel.1255;LC_NUMERIC=C;LC_TIME=Hebrew_Israel.1255"

"דור"
[1] "ãåø"

x
   x
1 ãåø
2 dor

Using English locale, I've get this output.

Sys.setlocale("LC_ALL", "English")
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

 "דור"
[1] "דור"

x
                         x
1 <U+05D3><U+05D5><U+05E8>
2                      dor

Note that non data.frame output prints fine. It also occurs with data.table class, and prints fine with list and matrix.

Checking both print.data.frame and print.table methods reveals the main suspect: format.

Further investigation confirm these suspicions:

as.matrix(x)
     x    
[1,] "דור"
[2,] "dor"

format(as.matrix(x))
     x                         
[1,] "<U+05D3><U+05D5><U+05E8>"
[2,] "dor                     "

As such in your case I suggest following this workflow:

Sys.setlocale("LC_ALL", "Hebrew")
x <- read.csv("https://raw.githubusercontent.com/talgalili/temp2/gh-pages/Hebrew_UTF8.txt", encoding="UTF-8")  
as.matrix(x) 
      âéì..áùðéí. îéâãø 
 [1,] "23.0"      "זכר" 
 [2,] "24.0"      "נקבה"
 [3,] "23.0"      "נקבה"
 [4,] "24.0"      "נקבה"
 [5,] "25.0"      "זכר" 
 [6,] "18.0"      "זכר" 
 [7,] "26.0"      "זכר" 
 [8,] "21.5"      "נקבה"
 [9,] "24.0"      "זכר" 
[10,] "26.0"      "זכר" 
[11,] "24.0"      "זכר" 
[12,] "19.0"      "נקבה"
[13,] "19.0"      "נקבה"
[14,] "24.5"      "זכר" 
[15,] "21.0"      "נקבה"

Both locales: Hebrew and English worked on my machine, but col.names didn't work for neither.

To conclude, this is far from being a complete solution, but just a small and partial work-around the printing (or shall recall the formatting) problem. It also shed some more light on this Hebrew / non-English issue in R-studio, on which some better solutions may be written. One example for a solution for a similar problem of writing Hebrew in windows can be seen on this SO thread.

这篇关于读取UTF-8文本文件(在希伯来语中)在RStudio的控制台中显示gibrish并且在RGUI中很好的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆