如何用中文字符读取excel文件[R]? [英] How to read excel file in Chinese character [R]?

查看:145
本文介绍了如何用中文字符读取excel文件[R]?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  myDataFrame<  -  read.csv( mydatafile.csv,stringsAsFactors = F)

但是,当我转换xlsx时,我遇到了一个严重的问题文件用中文写成。大部分的角色(并不是全部都显示)因为编码而显示'??'。



所以,我决定使用 xlsx 包直接导入。但问题是excel文件的大小超过10MB。
它给我一个错误消息,因为JVM内存限制。 (我假设 xlsx 在内部使用Java。)


.jcall RJavaTools,Ljava / lang / Object;,invokeMethod,cl,:

java.lang.OutOfMemoryError:超出GC开销限制


如何将中文excel文件导入到R中?我尝试另存为CSV文件,并打开它的记事本,并使用选项UTF-8保存,但结果相同(显示'??')。



FYI,我可以在原始的excel文件中看到完整的汉字。

解决方案

你的问题是一个混合的,我们假设你将xlsx文件转换成csv,如果没有,请参考其他线程,如这一个我认为这一步最好在一些外部工具中进行,比在R中。



现在我们有一个csv,还有两个问题,大小和编码对于编码,如您在注释中所述,您可以使用几个R函数的encoding =选项,如read.csv。对于从Excel中出来的中文文件,编码最可能是GB18030。如果无法决定,Libreoffice Calc的打开文件对话框可能会给你一些线索。



如果文件大小很大,可以先使用Linux命令iconv转换编码,然后在R中进一步处理。



现在对于大小部分。只要你有足够的内存,一个50mb甚至500MB的csv可以很容易地被read.csv处理,尽管不一定很快。如果文件大于1G,有两个选项:


  1. 使用sqldf包将csv读入临时数据库,然后进入一个data.frame。

  2. 逐行处理csv。首先使用file()创建一个连接,然后使用readLines()逐行处理它。最后手动将结果合并成数据框架或其他适当的结构。

第一个更简单,第二个可以真正处理大文件。



希望有帮助。


I always convert excel file into CSV file to import to R as following.

myDataFrame <- read.csv("mydatafile.csv", stringsAsFactors=F)

But, I got a serious problem when I convert xlsx file which is written in Chinese. Most of characters(not all of them) shows '??' because of encoding.

So, I decided to use xlsx package to import directly. But the problem is that size of excel file exceeds 10MB. It gave me an error message because of JVMs memory limit. (I assume that xlsx uses Java internally.)

Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, :
java.lang.OutOfMemoryError: GC overhead limit exceeded

How can I import chinese excel file to R? I tried 'Save as..' CSV file, and opened it notepad, and save it with option 'UTF-8'. but the result was the same(shows '??').

FYI, I can see full chinese character in the original excel file.

解决方案

Your question is a mixed one. Let's assume that you have converted the xlsx file into csv. If you haven't, please refer to other threads like this one. I think this step is best carried out in some externel tool rather than in R.

Now we've got a csv, there remain two problems, size and encoding. For encoding, as you have mentioned in the comment, you can use the encoding= option of several R functions like read.csv. For Chinese files coming out of Excel, the encoding is most probably "GB18030". If cannot decide, the open file dialog of Libreoffice Calc may give you some clue.

If the file size is large, you may first convert the encoding using the Linux command iconv, and then further process it in R.

Now for the size part. A 50mb or even 500mb csv can easily handled by read.csv, although not necessarily fast, provided that you have enough memory. If the file is larger than 1G, there are two options:

  1. Use the sqldf package, which reads the csv into a temporary database, and then into a data.frame.
  2. Process the csv line by line. First use file() to create a connection, then use readLines() to process it line by line. Finally manually combine the result into a data.frame or other appropriate structure.

The first one is simpler, the second one can handle really large file.

Hope it helps.

这篇关于如何用中文字符读取excel文件[R]?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆