极慢的R代码和挂 [英] Extremely slow R code and hanging

查看:388
本文介绍了极慢的R代码和挂的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

调用 read.table()函数(在 CSV 文件),如下所示:

  download.file(url,destfile = file,mode =w)
conn< - gzcon(bzfile open =r))
try(fileData < - read.table(conn,sep =,,row.names = NULL),silent = FALSE)

会产生以下错误:

  pushBack(c(lines,lines),file):
只能推回文本模式连接


$ b b

我试图通过 tConn <-textConnection(readLines(conn))显式地包装连接[然后,传递 tConn 而不是 conn read.table()]代码执行缓慢和最终挂起或R进程(不得不重新启动R)。



UPDATE(再次显示,有用的是尝试向其他人解释你的问题! ):



在写这篇文章的时候,我决定回到文档并在 gzcon() ,我认为它不仅解压缩 bzip2 文件,而是标记为文本。但是我意识到这是一个可笑的假设,因为我知道它是一个文本( CSV )在 bzip2 归档,但R不。因此,我最初尝试使用 textConnection()是正确的方法,但是有些问题。如果 - 这是一个大IF - 我的逻辑是正确的,直到这个,下一个问题是问题是否是由于 textConnection() readLines )



请指教。谢谢!



我想要读取的CSV文件采用几乎CSV格式,因此我无法使用标准R函数进行CSV处理。



===



UPDATE 1(计划输出):



===

 尝试网址http://flossdata.syr.edu/data/fc/2013/2013-Dec/fcProjectAuthors2013-Dec.txt.bz2'
内容类型'application / x-bzip2'长度514960字节(502 Kb)
打开URL
============================ =========================
下载502 Kb

尝试网址'http://flossdata.syr .edu / data / fc / 2013/2013-Dec / fcProjectDependencies2013-Dec.txt.bz2'
内容类型'application / x-bzip2'长度133295字节(130 Kb)
打开URL
=============================================== =
下载130 Kb

尝试URLhttp://flossdata.syr.edu/data/fc/2013/2013-Dec/fcProjectDescriptions2013-Dec.txt.bz2'
内容类型'application / x-bzip2'length 5404286字节(5.2 Mb)
打开URL
========================== ===========================
下载5.2 Mb

===



UPDATE 2(程序输出):



很长时间后,我收到以下消息,然后程序继续处理其余的文件:

 扫描时出错(file,what,nmax,sep,dec,quote,skip,nlines,na.strings,:
line 1没有8个元素

然后情况重复:处理几个较小程序冻结处理更大(> 1MB)的文件:

 尝试URLhttp://flossdata.syr。 edu / data / fc / 2013/2013-Dec / fcProjectTags2013-Dec.txt.bz2'
内容类型'application / x-bzip2'长度1226391字节(1.2 Mb)
打开URL
====================================================
下载1.2 Mb

===


$ b b

UPDATE 3(程序输出):



===



运行,我发现了以下:



*)我的假设文件大小〜1MB扮演在古怪的行为中的作用是错误的。这是基于程序成功处理大小大于1MB的文件并且不能处理大小< 1MB。这是有错误的示例输出:

 尝试网址http://flossdata.syr.edu/data/fsf/2012 /2012-Nov/fsfProjectInfo2012-Nov.txt.bz2'
内容类型'application / x-bzip2'长度826288字节(806 Kb)
打开URL
=======下载的806 Kb

扫描错误(file,what,nmax,sep,dec,quote,skip,nlines,na.strings,:
第1行没有4个元素
:警告消息:
1:在扫描(file,what,nmax,sep,dec,quote,skip,nlines,na.strings,:
在引号字符串内的EOF
2: ($)
在引号字符串中的EOF

错误处理非常小的文件的示例:

 尝试URL'http:// flossdata。 syr.edu/data/fsf/2012/2012-Nov/fsfProjectRequirements2012-Nov.txt.bz2'
内容类型'application / x-bzip2'长度3092字节
打开URL
== ============================
下载3092字节

扫描错误(file,what,nmax,sep,dec,quote,skip,nlines,na.strings,:
第2行没有2个元素

从上面的例子可以清楚地看出,size不是因子,而是文件结构。 / p>

*)我错误地报告了最大文件大小,它是54.2MB压缩。这是文件,该处理不仅生成错误消息并继续,但它实际上触发不可恢复的错误并停止(退出):

 试验URL'http://flossdata.syr.edu/data/gc/2012/2012-Nov/gcProjectInfo2012-Nov.txt.bz2'
内容类型'application / x-bzip2'长度56793796字节54.2 Mb)
打开网址
======================================= ============
下载54.2 Mb

在textConnection中出错(readLines(conn)):
无法为文本连接分配内存$ b $紧急退出后,5个R进程使用51%的内存,而在手动R重新启动后,这个数字保持7%(每个 htop 报告的数据)。



即使考虑到非常糟糕文本/ CSV格式(由Error in scan()messages建议),标准R函数的行为 textConnection()和/或 readLines code>看着我很奇怪,甚至可疑。我的理解是,良好的功能应该优雅地处理错误的输入数据,允许非常有限的时间/重试,然后继续处理,如果可能,或退出,当进一步处理是不可能的。在这种情况下,我们看到(通过缺陷票截图)R进程占用虚拟机的内存和处理器。

解决方案

您没有CSV文件。我只看了(是的,实际上有一个文本编辑器中查看)在其中一个,但他们似乎是制表符分隔。

  url<  - 'http://flossdata.syr.edu/data/fsf/2012/2012-Nov/fsfProjectRequirements2012-Nov.txt.bz2'
file< - temp.txt.bz2
download.file(url,destfile = file,mode =w)
dat< - bzfile(file,open =r)
DF< - read.table ,header = TRUE,sep =\t)
close(dat)

head(DF)
#proj_num proj_unixname required requirement_type date_collected datasource_id
#1 14 A2ps电子邮件帮助,支持2012-11-02 10:57:40 346
#2 99 Acct电子邮件Bug跟踪2012-11-02 10:57:40 346
#3 128 Adns VCS Repository Webview Developer 2012-11-02 10:57:40 346
#4 128 Adns电子邮件帮助2012-11-02 10:57:40 346
#5 196 AmaroK VCS存储库Webview错误跟踪2012-11-02 10:57:40 346
#6 196 AmaroK邮件列表信息/存档Bug跟踪,开发人员2012-11-02 10:57:40 346


Calling read.table() function (on a CSV file), as follows:

  download.file(url, destfile = file, mode = "w")
  conn <- gzcon(bzfile(file, open = "r"))
  try(fileData <- read.table(conn, sep = ",", row.names = NULL), silent = FALSE)

produces the following error:

Error in pushBack(c(lines, lines), file) : 
  can only push back on text-mode connections

I tried to "wrap" the connection explicitly by tConn <- textConnection(readLines(conn)) [and then, certainly, passing tConn instead of conn to read.table()], but it triggered extreme slowness in code execution and eventual hanging or R processes (had to restart R).

UPDATE (That shows again how useful is to try to explain your problems to other people!):

As I was writing this, I decided to go back to documentation and read again on gzcon(), which I thought not only decompresses bzip2 file, but "labels" it as text. But then I realized that it’s a ridiculous assumption, as I know that it’s a text (CSV) file inside the bzip2 archive, but R doesn’t. Therefore, my initial attempt to use textConnection() was the right approach, but something creates a problem. If - and it’s a big IF - my logic is correct until this, the next question is whether the problem is due to textConnection() or readLines().

Please advise. Thank you!

P.S. The CSV files that I'm trying to read are in an "almost" CSV format, so I can't use standard R functions for CSV processing.

===

UPDATE 1 (Program Output):

===

trying URL 'http://flossdata.syr.edu/data/fc/2013/2013-Dec/fcProjectAuthors2013-Dec.txt.bz2'
Content type 'application/x-bzip2' length 514960 bytes (502 Kb)
opened URL
==================================================
downloaded 502 Kb

trying URL 'http://flossdata.syr.edu/data/fc/2013/2013-Dec/fcProjectDependencies2013-Dec.txt.bz2'
Content type 'application/x-bzip2' length 133295 bytes (130 Kb)
opened URL
==================================================
downloaded 130 Kb

trying URL 'http://flossdata.syr.edu/data/fc/2013/2013-Dec/fcProjectDescriptions2013-Dec.txt.bz2'
Content type 'application/x-bzip2' length 5404286 bytes (5.2 Mb)
opened URL
==================================================
downloaded 5.2 Mb

===

UPDATE 2 (Program output):

===

After very long time, I'm getting the following message, then the program continues processing the rest of the files:

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  line 1 did not have 8 elements

Then the situation repeats: after processing several smaller (less than 1MB) files, the program "freezes" on processing a larger (> 1MB) file:

trying URL 'http://flossdata.syr.edu/data/fc/2013/2013-Dec/fcProjectTags2013-Dec.txt.bz2'
Content type 'application/x-bzip2' length 1226391 bytes (1.2 Mb)
opened URL
==================================================
downloaded 1.2 Mb

===

UPDATE 3 (Program output):

===

After giving the program more time to run, I discovered the following:

*) My assumption that file size ~1MB plays role in weird behavior was wrong. This is based on the fact that the program successfully processed files with size > 1MB and could not process files with size < 1MB. This is an example output with errors:

trying URL 'http://flossdata.syr.edu/data/fsf/2012/2012-Nov/fsfProjectInfo2012-Nov.txt.bz2'
Content type 'application/x-bzip2' length 826288 bytes (806 Kb)
opened URL
==================================================
downloaded 806 Kb

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  line 1 did not have 4 elements
In addition: Warning messages:
1: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  EOF within quoted string
2: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  EOF within quoted string

Example with errors processing very small file:

trying URL 'http://flossdata.syr.edu/data/fsf/2012/2012-Nov/fsfProjectRequirements2012-Nov.txt.bz2'
Content type 'application/x-bzip2' length 3092 bytes
opened URL
==================================================
downloaded 3092 bytes

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  line 2 did not have 2 elements

From the above examples, it is clear that size is not the factor, but file structure might be.

*) I wrongfully reported the maximum file size, it's 54.2MB compressed. This is the file, which processing not only generates error messages and continues, but it actually triggers an unrecoverable error and stops (exits):

trying URL 'http://flossdata.syr.edu/data/gc/2012/2012-Nov/gcProjectInfo2012-Nov.txt.bz2'
Content type 'application/x-bzip2' length 56793796 bytes (54.2 Mb)
opened URL
=================================================
downloaded 54.2 Mb

Error in textConnection(readLines(conn)) : 
  cannot allocate memory for text connection

*) After emergency exit, five R processes use 51% of memory each, while after manual R restart, this number remains 7% (data per htop report).

Even considering the possibility of "very bad" text/CSV format (suggested by "Error in scan() messages"), the behavior of standard R functions textConnection() and/or readLines() look to me very strange, even "suspicious". My understanding is that good function should process erroneous input data gracefully, allowing very limited time/retries and then continuing processing, if possible, or exiting when further processing is impossible. In this case we see (via the defect ticket screenshot) that R process is taxing both memory and processor of the virtual machine.

解决方案

You don't have CSV files. I only looked (yes, actually had a look in a text editor) at one of them but they seem to be tab delimited.

url <- 'http://flossdata.syr.edu/data/fsf/2012/2012-Nov/fsfProjectRequirements2012-Nov.txt.bz2'
file <- "temp.txt.bz2"
download.file(url, destfile = file, mode = "w")
dat <- bzfile(file, open = "r")
DF <- read.table(dat, header=TRUE, sep="\t")
close(dat)

head(DF)
#   proj_num proj_unixname               requirement       requirement_type      date_collected datasource_id
# 1       14          A2ps                    E-mail           Help,Support 2012-11-02 10:57:40           346
# 2       99          Acct                    E-mail           Bug Tracking 2012-11-02 10:57:40           346
# 3      128          Adns    VCS Repository Webview              Developer 2012-11-02 10:57:40           346
# 4      128          Adns                    E-mail                   Help 2012-11-02 10:57:40           346
# 5      196        AmaroK    VCS Repository Webview           Bug Tracking 2012-11-02 10:57:40           346
# 6      196        AmaroK Mailing List Info/Archive Bug Tracking,Developer 2012-11-02 10:57:40           346

这篇关于极慢的R代码和挂的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆