R + ggplot2-无法分配大小为128.0 Mb的向量 [英] R + ggplot2 - Cannot allocate vector of size 128.0 Mb

查看:73
本文介绍了R + ggplot2-无法分配大小为128.0 Mb的向量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个4.5MB(9,223,136行)的文件,其中包含以下信息:

I have a file of 4.5MB (9,223,136 lines) with the following information:

0       0
0.0147938       3.67598e-07
0.0226194       7.35196e-07
0.0283794       1.10279e-06
0.033576        1.47039e-06
0.0383903       1.83799e-06
0.0424806       2.20559e-06
0.0465545       2.57319e-06
0.0499759       2.94079e-06

在每一列中,一个值表示一个从0到100的值,表示一个百分比.我的目标是在ggplot2中绘制图形,以查看它们之间的百分比(例如,使用column1的20%,column2达到的百分比是多少).这是我的R脚本:

In each column a value is represented a value from 0 to 100 meaning a percentage. My goal is to draw a graphic in ggplot2 to see check the percentages between them (e.g. with 20% of column1 what is the percentage achieved on column2). Heres is my R script:

library(ggplot2)
dataset=read.table("~/R/datasets/cumul.txt.gz")
p <- ggplot(dataset,aes(V2,V1))
p <- p + geom_line()
p <- p + scale_x_continuous(formatter="percent") + scale_y_continuous(formatter="percent")
p <- p + theme_bw()
ggsave("~/R/grafs/cumul.png")

我遇到了问题,因为每次我运行R都会耗尽内存,并出现错误:无法分配大小为128.0 Mb的向量".我在Linux机器上运行32位R,并且有大约4GB的可用内存.

I'm having a problem because every time i run this R runs out of memory, giving the error: "Cannot allocate vector of size 128.0 Mb ". I'm running 32-bit R on a Linux machine and i have about 4gb free memory.

我考虑了一种解决方法,该方法包括降低这些值的精度(通过四舍五入)并消除重复的行,以使数据集中的行数减少.您能给我一些建议吗?

I thought on a workaround that consists of reducing the precision of these values (by rounding them) and eliminate duplicate lines so that i have less lines on the dataset. Could you give me some advice on how to do this?

推荐答案

您确定在4.5MB的文件中有900万行(编辑:也许您的文件为4.5 GB?)?它必须经过高度压缩-当我创建一个十分之一的文件时,它的大小为115Mb ...

Are you sure you have 9 million lines in a 4.5MB file (edit: perhaps your file is 4.5 GB??)? It must be heavily compressed -- when I create a file that is one tenth the size, it's 115Mb ...

n <- 9e5
set.seed(1001)
z <- rnorm(9e5)
z <- cumsum(z)/sum(z)
d <- data.frame(V1=seq(0,1,length=n),V2=z)
ff <- gzfile("lgfile2.gz", "w")
write.table(d,row.names=FALSE,col.names=FALSE,file=ff)
close(ff)
file.info("lgfile2.gz")["size"]

从给出的信息中很难分辨出数据集中有什么样的重复行" ... unique(dataset)仅提取唯一的行,但是可能没有用.我可能首先将数据集细化100或1000倍:

It's hard to tell from the information you've given what kind of "duplicate lines" you have in your data set ... unique(dataset) will extract just the unique rows, but that may not be useful. I would probably start by simply thinning the data set by a factor of 100 or 1000:

smdata <- dataset[seq(1,nrow(dataset),by=1000),]

,然后看看它是如何进行的.(编辑:忘记了一个逗号!)

and see how it goes from there. (edit: forgot a comma!)

大数据集的图形表示通常是一个挑战.总的来说,您会过得更好:

Graphical representations of large data sets are often a challenge. In general you will be better off:

  • 在绘制数据之前以某种方式总结数据
  • 使用减少数据量的专用图形类型(密度图,轮廓线,六边形合并)
  • 使用基本图形(而不是点阵/ggplot/网格图形),而不是点阵/ggplot/网格图形,该基本图形使用绘制并忘记"模型(除非在Windows中打开了图形记录),然后保存一个完整的图形对象,然后对其进行渲染li>
  • 使用光栅或位图图形(PNG等),它们仅记录图像中每个像素的状态,而不是矢量图形,后者将保存所有对象(无论它们是否重叠)

这篇关于R + ggplot2-无法分配大小为128.0 Mb的向量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆