希伯来语在R中编码地狱,并在Windows中编写UTF-8表格 [英] Hebrew Encoding Hell in R and writing a UTF-8 table in Windows

查看:159
本文介绍了希伯来语在R中编码地狱,并在Windows中编写UTF-8表格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从



我尝试将编码更改为UTF-8,就像在几个相似的问题中提出的,但问题依然存在格式:

  iconv(lines,to =UTF-8)
1'''''' ''''''''''''''。 2'''''''''''''''''''''''''''''''''''''''''

希伯来语相同ISO-8859-8:



pre> iconv(lines,to =ISO-8859-8)
1×'×o×。×'×o×₪×T×××! ××T.×ר。 2×'×¢×z××a××××××××××××××××××××××○×××××××××××× ×

我不明白为什么控制台打好希伯来字符,而 write.table() write.csv() data.frame()



Ken回答,用writeLines()导出文本效果很好:





  f = file(lines.txt,open =wt,encoding =UTF-8)
writeLines(lines,lines.txt useBytes = TRUE)
close(f)

然而,主要问题 R以希伯来语编码,而处理表,格式为 as.data.frame() write.table()和 write.csv()。任何想法?



有些机器信息:

  Sys.info )
sysname发行版本
Windows7 x64build 7601,Service Pack 1
nodename机器登录
TALIS-TPx86

> Sys.getlocale()
[1]LC_COLLATE = English_United States.1252; LC_CTYPE = English_United States.1252; LC_MONETARY = English_United States.1252; LC_NUMERIC = C; LC_TIME = English_United States.1252


解决方案

许多人在平台上使用UTF-8文本时遇到类似问题, 8位系统编码(Windows)。由于不同的方法处理编码和转换不同,在一个平台(OS X或Linux)上可以正常工作的方法在另一个平台上工作得很差,所以R中的编码可能是棘手的。



该问题与您的输出连接以及Windows如何处理编码和文本连接有关。我尝试使用UTF-8和8位编码中的一些希伯来语文本来复制问题。我们还将阅读文件阅读问题,因为也可能会有一些错误。



用于测试




  • 创建一个短的希伯来语语言文本文件,编码为UTF-8:希伯来语utf8.txt


  • 创建了一个短的希伯来语语言文本文件,编码为ISO-8859-8: hebrew-iso-8859-8.txt 。 (注意:您可能需要告诉您的浏览器,以便正确地查看此编码。例如,Safari就是这种情况。)




阅读文件的方式



现在让我们进行实验。我正在使用Windows 7进行这些测试(它实际上可以在OS X中,通常的操作系统)。

  lines<  -  readLines (http://kenbenoit.net/files/hebrew-utf8.txt)
lines
## [1]××¢×'רי×ו×× - ××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××× ª。
## [2]××××××××ת××××××××××××××××××××××××××××× ×§×××××××××××××××××××××××××××××××××××××××××××××××××××××××××××× ×ª×ž××××××××××××××××××××××××××ת

这是因为它假设编码是您的系统编码,Windows-1252,但由于读取文件时没有转换,您可以通过设置编码位到UTF-8:

 #这将设置UTF-8的位
编码(行)< - UTF-8
lines
## [1]העבריהואחברבקבוצההכנעניתשלשפותשמיות。
## [2]זוהיתהשפתםשלהיהודיםמוקדם,אבלמן586 לפנה\סזההתחיללהיותמוחלףעלידיבארמית。

但是,当您阅读文件时,最好这样做:

 #这是一次通过
lines2< - readLines(http://kenbenoit.net/files/hebrew-utf8.txt,编码=UTF-8)
lines2 [1]
## [1]העבריהואחברבקבוצההכנעניתשלשפותשמיות。
编码(lines2)
## [1]UTF-8UTF-8

现在看看如果我们尝试读取相同的文本,但编码为8位ISO希伯来语代码页,将会发生什么。

  lines3<  -  readLines(http://kenbenoit.net/files/hebrew-iso-8859-8.txt)
lines3 [1]
## [1 ]äòáøéäåàçáøá÷áåöääëðòðéúùìùôåúùîéåú。

设置编码位在这里没有帮助,因为读取的内容不映射到Unicode代码点数为希伯来语和 Encoding()不进行实际的编码转换,它只是设置一个额外的位,可以用来告诉R一个可能的编码值。我们可以通过在 readLines()调用中添加 encoding =ISO-8859-8来解决这个问题。我们也可以在加载后转换文本,使用 iconv()

 #这不会修复东西
编码(lines3)< - UTF-8
lines3 [1]
## [1]\xe4\xf2\xe1 \xf8\xe9 \xe4\xe5\xe0 \xe7\xe1\xf8 \xe1\xf7\xe1\xe5\xf6\xe4 \xe4\xeb\ xf0 \xf2\xf0\xe9\xfa \xf9\xec \xf9\xf4\xe5\xfa \xf9\xee\xe9\xe5\xfa。
#但这将
iconv(lines3,ISO-8859-8,UTF-8)[1]
## [1]העבריהואחברבקבוצההכנעניתשלשפות שמיות。总体来说,我认为上述用于 lines2 是最好的方法。



如何输出文件,保存编码



现在就您的问题写这个:最安全的方法是在低级别控制你的连接,在那里你可以指定编码。否则,默认值为R / Windows选择您的系统编码,这将丢失UTF-8。我以为这会工作,这在OS X 中是非常好的,而在OS X上也可以正常地调用 writeLines()命名一个文本文件没有textConnection。

  ##来写行,使用连接对象的编码选项
f< file(hebrew-output-UTF-8.txt,open =wt,encoding =UTF-8)
writeLines(lines2,f)
close(f)

但它在Windows上不起作用。您可以在这里查看Windows 7的结果:希伯来文输出 - UTF-8 -file_encoding.txt



所以,这里是如何在Windows中 :一旦你确定你的文字被编码作为UTF-8,只需将其作为原始字节写入,而不使用任何编码,如下所示:

  writeLines(lines2,hebrew -output-UTF-8-useBytesTRUE.txt,useBytes = TRUE)

您可以看到结果在希伯来语输出 - UTF-8-useBytesTRUE.txt ,现在是UTF-8并且看起来正确。


添加为write.csv


请注意,您想要执行此操作的唯一原因是使.csv文件可用于导入到其他软件(如Excel)中。 (在Excel / Windows中运行UTF-8,祝你好运...)否则,您应该使用 write(myDataFrame,file =myDataFrame.RData)将数据表作为二进制写入。 code>。但是如果您真的需要输出.csv,那么:



如何从 data.table Windows



使用 write.table()编写UTF-8文件的问题, code> write.csv()是这些打开的文本连接,Windows对于与UTF-8相关的编码和文本连接有限制。 (这篇文章提供了有用的解释。)遵循一个SO答复发布这里,我们可以重写这个来写我们自己的输出UTF-8 .csv文件的功能。



这假设您已经设置了 Encoding()任何字符元素到UTF-8(在 lines2 导入之后发生)。

  df<  -  data.frame(int = 1:2,text = lines2,stringsAsFactors = FALSE)

write_utf8_csv < - function(df,file){
firstline< - paste('',names(df),'',sep =,collapse =,)
data< ; - apply(df,1,function(x){paste('',x,',sep =,collapse =,)})
writeLines(c(firstline,data) ,file,useBytes = TRUE)
}

write_utf8_csv(df,df_csv.t xt)

当我们现在在非Unicode挑战的操作系统中查看该文件时,现在看起来罚款:

  KBsMBP15-2:桌面kbenoit $ cat df_csv.txt 
int,text
1,העבריהואחברבקבוצההכנעניתשלשפותשמיות。
2,אוהיתהשפתםשלהיהודיםמוקדם,אבלמן586לפנהסההתחיללהיותמוחלףעלידיבארמית
KBsMBP15-2:桌面kbenoit $文件df_csv.txt
df_csv.txt:UTF-8 Unicode文本,带有CRLF行终止符


I'm trying to save data extracted with RSelenium from https://www.magna.isa.gov.il/Details.aspx?l=he, but although R succeeds printing Hebrew character to the console, it does not when exporting TXT, CSV or in other simple R functions, like data.frame(), readHTMLTable(), etc.

Here goes an example.

> head(lines)
[1] "גלובל פיננס ג'י.אר. 2 בע\"מ נתונים כספיים באלפי דולר ארה\"ב"
[2] "513435404"                                                  
[3] ""                                                           
[4] ""                                                           
[5] ""                                                           
[6] "4,481" 

First line changes to weird characters (below) when using data.frame()

> head(as.data.frame(lines))
[1] <U+05D2><U+05DC><U+05D5><U+05D1><U+05DC> <U+05E4><U+05D9><U+05E0><U+05E0><U+05E1> <U+05D2>'<U+05D9>.<U+05D0><U+05E8>. 2 <U+05D1><U+05E2>"<U+05DE> <U+05E0><U+05EA><U+05D5><U+05E0><U+05D9><U+05DD> <U+05DB><U+05E1><U+05E4><U+05D9><U+05D9><U+05DD> <U+05D1><U+05D0><U+05DC><U+05E4><U+05D9> <U+05D3><U+05D5><U+05DC><U+05E8> <U+05D0><U+05E8><U+05D4>"<U+05D1>

The same happens when exporting .TXT or .CSV by write.table or write.csv:

write.csv(lines,"lines.csv",row.names=FALSE)

I tried to change the encoding to "UTF-8", like suggested in several alike questions, yet, the issue remains in a different format:

iconv(lines, to = "UTF-8")
1 ׳’׳׳•׳‘׳ ׳₪׳™׳ ׳ ׳¡ ׳’'׳™.׳׳¨. 2 ׳‘׳¢"׳ ׳ ׳×׳•׳ ׳™׳ ׳›׳¡׳₪׳™׳™׳ ׳‘׳׳׳₪׳™ ׳"׳•׳׳¨ ׳׳¨׳""׳‘

Same for Hebrew ISO-8859-8:

iconv(lines, to = "ISO-8859-8")
    1 ×'×o×.×'×o ×₪×T× × ×! ×''×T.×ר. 2 ×'×¢"×z × ×a×.× ×T× ×>×!×₪×T×T× ×'××o×₪×T ×"×.×oר ×ר×""×'

I don't understand why the console prints Hebrew characters well while write.table(), write.csv() and data.frame() presents encoding issues.

Anyone to help me exporting it?

That was answered by Ken, exporting text with writeLines() worked well:

f = file("lines.txt", open = "wt", encoding = "UTF-8")
writeLines(lines, "lines.txt", useBytes = TRUE)
close(f) 

Yet, the main issue R has with Hebrew encoding is while dealing with tables, in the form of as.data.frame(), write.table() and write.csv(). Any thoughts?

Some machine info:

Sys.info()
                 sysname                      release                      version 
               "Windows"                      "7 x64" "build 7601, Service Pack 1" 
                nodename                      machine                        login 
              "TALIS-TP"                        "x86"

> Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

解决方案

Many many people have similar problems working with UTF-8 text on platforms that have 8-bit system encodings (Windows). Encoding in R can be tricky, because different methods handle encoding and conversions differently, and what appears to work fine on one platform (OS X or Linux) works poorly on another.

The problem has to do with your output connection and how Windows handles encodings and text connections. I've tried to replicate the problem using some Hebrew texts in both UTF-8 and an 8-bit encoding. We'll walk through the file reading issues as well, since there could be some snags there too.

For Tests

  • Created a short Hebrew language text file, encoded as UTF-8: hebrew-utf8.txt

  • Created a short Hebrew language text file, encoded as ISO-8859-8: hebrew-iso-8859-8.txt. (Note: You might need to tell your browser about the encoding in order to view this one properly - that's the case for Safari for instance.)

Ways to read the files

Now let's experiment. I am using Windows 7 for these tests (it actually works in OS X, my usual OS).

lines <- readLines("http://kenbenoit.net/files/hebrew-utf8.txt")
lines
## [1] "×"עברי ×"×•× ×—×‘×¨ בקבוצ×" ×"×›× ×¢× ×™×ª של שפות שמיות."                                                                     
## [2] "זו ×"ית×" ×©×¤×ª× ×©×œ ×"×™×"ו×"×™× ×ž×•×§×"×, ×בל מן 586 ×œ×¤× ×"\"ס ×–×" ×"תחיל ל×"יות מוחלף על ×™×"×™ ב×רמית."

That failed because it assumed the encoding was your system encoding, Windows-1252. But because no conversion occurred when you read the files, you can fix this just by setting the Encoding bit to UTF-8:

# this sets the bit for UTF-8
Encoding(lines) <- "UTF-8"
lines
## [1] "העברי הוא חבר בקבוצה הכנענית של שפות שמיות."                                          
## [2] "זו היתה שפתם של היהודים מוקדם, אבל מן 586 לפנה\"ס זה התחיל להיות מוחלף על ידי בארמית."

But better to do this when you read the file:

# this does it in one pass
lines2 <- readLines("http://kenbenoit.net/files/hebrew-utf8.txt", encoding = "UTF-8")
lines2[1]
## [1] "העברי הוא חבר בקבוצה הכנענית של שפות שמיות."
Encoding(lines2)
## [1] "UTF-8" "UTF-8"

Now look at what happens if we try to read the same text, but encoded as the 8-bit ISO Hebrew code page.

lines3 <- readLines("http://kenbenoit.net/files/hebrew-iso-8859-8.txt")
lines3[1]
## [1] "äòáøé äåà çáø á÷áåöä äëðòðéú ùì ùôåú ùîéåú." 

Setting the Encoding bit is of no help here, because what was read does not map to the Unicode code points for Hebrew, and Encoding() does no actual encoding conversion, it merely sets an extra bit that can be used to tell R one of a few possible encoding values. We could have solved this by adding encoding = "ISO-8859-8" to the readLines() call. We can also convert the text after loading, using iconv():

# this will not fix things
Encoding(lines3) <- "UTF-8"
lines3[1]
## [1] "\xe4\xf2\xe1\xf8\xe9 \xe4\xe5\xe0 \xe7\xe1\xf8 \xe1\xf7\xe1\xe5\xf6\xe4 \xe4\xeb\xf0\xf2\xf0\xe9\xfa \xf9\xec \xf9\xf4\xe5\xfa \xf9\xee\xe9\xe5\xfa."
# but this will
iconv(lines3, "ISO-8859-8", "UTF-8")[1]
## [1] "העברי הוא חבר בקבוצה הכנענית של שפות שמיות."

Overall I think the method used above for lines2 is the best approach.

How to output the files, preserving encoding

Now to your question about how to write this: The safest way is to control your connection at a low level, where you can specify the encoding. Otherwise, the default is for R/Windows to choose your system encoding, which will lose the UTF-8. I thought this would work, which does work absolutely fine in OS X - and on OS X also works fine calling writeLines() just naming a text file without the textConnection.

## to write lines, use the encoding option of a connection object
f <- file("hebrew-output-UTF-8.txt", open = "wt", encoding = "UTF-8")
writeLines(lines2, f)
close(f)

But it does not work on Windows. You can see the Windows 7 results here: hebrew-output-UTF-8-file_encoding.txt.

So, here is how to do it in Windows: Once you are sure your text is encoded as UTF-8, just write it as raw bytes, without using any encoding, like this:

writeLines(lines2, "hebrew-output-UTF-8-useBytesTRUE.txt", useBytes = TRUE)

You can see the results at hebrew-output-UTF-8-useBytesTRUE.txt, which is now UTF-8 and looks correct.

Added for write.csv

Note that the only reason you would want to do this is to make the .csv file available for import into other software, such as Excel. (And good luck working with UTF-8 in Excel/Windows...) Otherwise, you should just write the data.table as binary using write(myDataFrame, file = "myDataFrame.RData"). But if you really need to output .csv, then:

How to write UTF-8 .csv files from a data.table in Windows

The problem with writing UTF-8 files using write.table() and write.csv() is that these open text connections, and Windows has limitations about encodings and text connections with respect to UTF-8. (This post offers a helpful explanation.) Following from an SO answer posted here, we can override this to write our own function to output UTF-8 .csv files.

This assumes that you have already set the Encoding() for any character elements to "UTF-8" (which happens upon import above for lines2).

df <- data.frame(int = 1:2, text = lines2, stringsAsFactors = FALSE)

write_utf8_csv <- function(df, file) {
    firstline <- paste('"', names(df), '"', sep = "", collapse = " , ")
    data <- apply(df, 1, function(x) {paste('"', x, '"', sep = "", collapse = " , ")})
    writeLines(c(firstline, data), file , useBytes = TRUE)
}

write_utf8_csv(df, "df_csv.txt")

When we now look at that file in non-Unicode-challenged OS, it now looks fine:

KBsMBP15-2:Desktop kbenoit$ cat df_csv.txt 
"int" , "text"
"1" , "העברי הוא חבר בקבוצה הכנענית של שפות שמיות."
"2" , "זו היתה שפתם של היהודים מוקדם, אבל מן 586 לפנה"ס זה התחיל להיות מוחלף על ידי בארמית."
KBsMBP15-2:Desktop kbenoit$ file df_csv.txt 
df_csv.txt: UTF-8 Unicode text, with CRLF line terminators

这篇关于希伯来语在R中编码地狱,并在Windows中编写UTF-8表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆