读取UTF-8文本文件(在希伯来语中)在RStudio的控制台中显示gibrish并且在RGUI中很好 [英] Reading a UTF-8 text file (in Hebrew) shows gibrish in RStudio's console and fine in RGUI
问题描述
我正在读一个csv文件到R中。当我把它打印到Rtudio中时,控制台在RStudio我得到gibrish(除非我看一个特定的向量)。而在Rgui这是很好。
我将运行的代码是:
Sys.setlocale(LC_ALL,Hebrew)
x< - read.csv(https://raw.githubusercontent.com/talgalili/temp2/gh-pages/Hebrew_UTF8.txt ,encoding =UTF-8)
x#显示gibrish
x [,2]
colnames(x)
这是RStudio(gibrish)的输出。
x< - read.csv(https://raw.githubusercontent.com/talgalili/temp2/gh-pages/Hebrew_UTF8.txt,encoding =UTF-8)
> x
âéì..áùðéí。 ééãã
1 23.0æëø
2 24.0ð÷áä
3 23.0ð÷áä
4 24.0ð÷áä
5 25.0æëø
6 18.0æëø
7 26.0æëø
8 21.5ð÷áä
9 24.0æëø
10 26.0æëø
11 24.0æëø
12 19.0ð÷áä
13 19.0ð÷áä
14 24.5æëø
15 21.0ð÷áä
> x [,2]
[1]ז;הההההההההההההההההההההההה
colnames(x)
[1]âéì..áùðéí。 îéâãø
>
在这里它是Rgui(这里很好):
> x< - read.csv(https://raw.githubusercontent.com/talgalili/temp2/gh-pages/Hebrew_UTF8.txt,encoding =UTF-8)
> x#显示gibrish
גיל..בשנים。 מיגדר
1 23.0זכר
2 24.0נקבה
3 23.0נקבה
4 24.0נקבה
5 25.0זכר
6 18.0זכר
7 26.0זכר
8 21.5נקבה
9 24.0זכר
10 26.0זכר
11 24.0זכר
12 19.0נקבה
13 19.0נקבה
14 24.5זכר
15 21.0נקבה
> x [,2]
[1]ז;הההההההההההההההההההההההה
colnames(x)
[1]גיל..בשנים。 מיגדר
>
在这两个会话中,我的sessionInfo()是:
> sessionInfo()
R版本3.2.3(2015-12-10)
平台:x86_64-w64-mingw32 / x64(64位)
运行时:Windows 7 x64 )Service Pack 1
locale:
[1] LC_COLLATE = Hebrew_Israel.1255 LC_CTYPE = Hebrew_Israel.1255
[3] LC_MONETARY = Hebrew_Israel.1255 LC_NUMERIC = C
[5] LC_TIME = Hebrew_Israel.1255
附加的基本包:
[1] stats graphics grDevices datasets utils方法base
其他附加包:
[1] installr_0.17.0
我使用的是最新的RStudio版本0.99.892
感谢。
解决方案这是R-studio中的错误,一。我看过您已经收到了关于R-studio目前在Windows上支持非英语语言环境的问题的一般回答。据我所知,这不是第一次/版本有类似的问题。您还可能遇到一些我认为与win 10相关的新问题 。注意,因为我也有第二种类型的问题,我使用英语区域设置打印希伯来语。
所以我试过一些调试你的问题,有一些解决方法,一些新的见解(我认为..)在哪里的问题。我认为它可以进一步调试写一个完整的函数,将修复它,但由于时间(和小时)限制我决定停止这里。
我've created this data:
x < - data.frame(x= c(דור,dor ))
如前所述,使用希伯来语区域设置I以及获取
Sys.setlocale(LC_ALL,Hebrew)
[1]LC_COLLATE = Hebrew_Israel.1255; LC_CTYPE = Hebrew_Israel.1255; LC_MONETARY = Hebrew_Israel.1255; LC_NUMERIC = C; LC_TIME = Hebrew_Israel.1255
דור
[1]ãåø
x
x
1ãåø
2 dor
使用英语区域设置, 。
Sys.setlocale(LC_ALL,English)
[1]LC_COLLATE = English_United States。 1252; LC_CTYPE = English_United States.1252; LC_MONETARY = English_United States.1252; LC_NUMERIC = C; LC_TIME = English_United States.1252
דור
[1]דור
x
x
1< U + 05D3>< U + 05D5>< U + 05E8&
2 dor
注意,
data.frame
输出打印精细。也可以使用data.table
类进行打印,并使用list
和code>。
检查
print.data.frame
和表
方法揭示主要嫌疑人:format
。
进一步调查证实这些怀疑:
as.matrix )
x
[1,]
[2,]dor
格式(as.matrix(x))
x
[1,]< U + 05D3>< U + 05D5>< U + 05E8&
[2,]dor
:
Sys.setlocale(LC_ALL,Hebrew)
x< - read.csv(https ://raw.githubusercontent.com/talgalili/temp2/gh-pages/Hebrew_UTF8.txt,encoding =UTF-8)
as.matrix(x)
âéã..áùðéí。 îéâ€
[1,]23.0זכר
[2,]24.0נקבה
[3,]23.0נקבה
[ ]24.0נקבה
[5,]25.0זכר
[6,]18.0זכר
[7,]26.0 b $ b [8,]21.5נקבה
[9,]24.0זכר
[10,]26.0זכר
[11, 24.0זכר
[12,]19.0נקבה
[13,]19.0נקבה
[14,]24.5זכר
[15,]21.0נקבה
两个地区:希伯来语和英语在我的机器上工作,但
col.names
对两者都不起作用。
总而言之,这不是一个完整的解决方案,而是一个小的和部分的工作,打印(或记得格式化)问题。它还在R-studio中对这个希伯来语/非英语问题有了更多的了解,可以在其中写出一些更好的解决方案。在Windows中编写希伯来语的类似问题的解决方案的一个例子可以看到在此SO线程。
I am trying to understand if this is a bug in RStudio or am I missing something.
I am reading a csv file into R. When printing it into the console in RStudio I get gibrish (unless I look at a specific vector). While in Rgui this is fine.
The code I will run is this:
Sys.setlocale("LC_ALL", "Hebrew") x <- read.csv("https://raw.githubusercontent.com/talgalili/temp2/gh-pages/Hebrew_UTF8.txt", encoding="UTF-8") x # shows gibrish x[,2] colnames(x)
Here is the output from RStudio (gibrish)
> x <- read.csv("https://raw.githubusercontent.com/talgalili/temp2/gh-pages/Hebrew_UTF8.txt", encoding="UTF-8") > x âéì..áùðéí. îéâãø 1 23.0 æëø 2 24.0 ð÷áä 3 23.0 ð÷áä 4 24.0 ð÷áä 5 25.0 æëø 6 18.0 æëø 7 26.0 æëø 8 21.5 ð÷áä 9 24.0 æëø 10 26.0 æëø 11 24.0 æëø 12 19.0 ð÷áä 13 19.0 ð÷áä 14 24.5 æëø 15 21.0 ð÷áä > x[,2] [1] זכר נקבה נקבה נקבה זכר זכר זכר נקבה זכר זכר זכר נקבה נקבה זכר נקבה Levels: זכר נקבה > colnames(x) [1] "âéì..áùðéí." "îéâãø" >
And here it is in Rgui (here it is fine):
> x <- read.csv("https://raw.githubusercontent.com/talgalili/temp2/gh-pages/Hebrew_UTF8.txt", encoding="UTF-8") > x # shows gibrish גיל..בשנים. מיגדר 1 23.0 זכר 2 24.0 נקבה 3 23.0 נקבה 4 24.0 נקבה 5 25.0 זכר 6 18.0 זכר 7 26.0 זכר 8 21.5 נקבה 9 24.0 זכר 10 26.0 זכר 11 24.0 זכר 12 19.0 נקבה 13 19.0 נקבה 14 24.5 זכר 15 21.0 נקבה > x[,2] [1] זכר נקבה נקבה נקבה זכר זכר זכר נקבה זכר זכר זכר נקבה נקבה זכר נקבה Levels: זכר נקבה > colnames(x) [1] "גיל..בשנים." "מיגדר" >
In both sessions, my sessionInfo() is:
> sessionInfo() R version 3.2.3 (2015-12-10) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 locale: [1] LC_COLLATE=Hebrew_Israel.1255 LC_CTYPE=Hebrew_Israel.1255 [3] LC_MONETARY=Hebrew_Israel.1255 LC_NUMERIC=C [5] LC_TIME=Hebrew_Israel.1255 attached base packages: [1] stats graphics grDevices datasets utils methods base other attached packages: [1] installr_0.17.0
I'm using the latest RStudio version 0.99.892
Thanks.
解决方案This is a bug in R-studio and not the only one. I've seen you have received a general answer about problems R-studio currently having with non-English locale support on windows. As far as I know it is not the first time / version having similar problems. You may also meet some new problems that I think related to win 10 . Note that since I'm having the second type of problems as well, I am using English locale to print Hebrew.
So I have tried some debugging on your problem there and came with some work-around, and some new insights (I think..) on where is the problem. I think it can be further debugged to write a complete function that will fix it, but due to time (and hour) restrictions I've decide to stop here.
I've created this data:
x <- data.frame("x"= c("דור","dor"))
As mentioned already, using Hebrew locale I as well get gibrish
Sys.setlocale("LC_ALL", "Hebrew") [1] "LC_COLLATE=Hebrew_Israel.1255;LC_CTYPE=Hebrew_Israel.1255;LC_MONETARY=Hebrew_Israel.1255;LC_NUMERIC=C;LC_TIME=Hebrew_Israel.1255" "דור" [1] "ãåø" x x 1 ãåø 2 dor
Using English locale, I've get this output.
Sys.setlocale("LC_ALL", "English") [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252" "דור" [1] "דור" x x 1 <U+05D3><U+05D5><U+05E8> 2 dor
Note that non
data.frame
output prints fine. It also occurs withdata.table
class, and prints fine withlist
andmatrix
.Checking both
print.data.frame
andprint.table
methods reveals the main suspect:format
.Further investigation confirm these suspicions:
as.matrix(x) x [1,] "דור" [2,] "dor" format(as.matrix(x)) x [1,] "<U+05D3><U+05D5><U+05E8>" [2,] "dor "
As such in your case I suggest following this workflow:
Sys.setlocale("LC_ALL", "Hebrew") x <- read.csv("https://raw.githubusercontent.com/talgalili/temp2/gh-pages/Hebrew_UTF8.txt", encoding="UTF-8") as.matrix(x) âéì..áùðéí. îéâãø [1,] "23.0" "זכר" [2,] "24.0" "נקבה" [3,] "23.0" "נקבה" [4,] "24.0" "נקבה" [5,] "25.0" "זכר" [6,] "18.0" "זכר" [7,] "26.0" "זכר" [8,] "21.5" "נקבה" [9,] "24.0" "זכר" [10,] "26.0" "זכר" [11,] "24.0" "זכר" [12,] "19.0" "נקבה" [13,] "19.0" "נקבה" [14,] "24.5" "זכר" [15,] "21.0" "נקבה"
Both locales: Hebrew and English worked on my machine, but
col.names
didn't work for neither.To conclude, this is far from being a complete solution, but just a small and partial work-around the printing (or shall recall the formatting) problem. It also shed some more light on this Hebrew / non-English issue in R-studio, on which some better solutions may be written. One example for a solution for a similar problem of writing Hebrew in windows can be seen on this SO thread.
这篇关于读取UTF-8文本文件(在希伯来语中)在RStudio的控制台中显示gibrish并且在RGUI中很好的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!