与R中的类似方法相比,Matlab中的文本扫描使用了过多的RAM [英] textscan in Matlab uses excessive RAM compared to similar method in R

查看:397
本文介绍了与R中的类似方法相比,Matlab中的文本扫描使用了过多的RAM的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Linux Mint v12上运行Matlab R2011b和R版本2.13.1与16 GB的RAM。



我有个csv文件。前5行(和标题)是:

 #RIC,Date [G],Time [G],GMT Offset,类型,价格,体积
DAEG.OQ,07-JUL-2011,15:10:03.424,-4,Trade,1.68,1008
DAEG.OQ,07-JUL-2011,15:10 :03.424,-4,Trade,1.68,1008
DAEG.OQ,07-JUL-2011,15:10:03.424,-4,Trade,1.66,300
DAEG.OQ,07-JUL -2011,15:10:03.424,-4,Trade,1.65,1000
DAEG.OQ,07-JUL-2011,15:10:03.464,-4,Trade,1.65,3180

文件较大(约900MB)。给定字符和数字数据的组合,可以将该文件读入matlab,如下所示:

  fid1 = fopen /MyUserName/Temp/X.csv'); 
D = textscan(fid1,'%s%s%s%f%s%f%f','Delimiter',',','HeaderLines',1);
fclose(fid1);

虽然文件是900MB,当运行上面的代码时,系统监视器指示我的RAM使用从2GB到10GB。更糟糕的是,如果我尝试这个相同的过程与一个稍大的csv文件(大约1.2 GB)我的RAM最大在16GB和Matlab从来没有设法完成数据读取(它只是停留在忙模式)。



如果我想将同一个文件读入R,我可以使用:

  D<  -  read.csv(/ home / MyUserName / Temp / X.csv,stringsAsFactors = FALSE)

这比Matlab长一点,但是系统监视器显示我的RAM使用量只从2GB跳到3.3GB(考虑到原始文件大小,更合理)。



我的问题有两部分:



1)为什么在这种情况下 textscan



2)有没有另一种方法可以用来获取一个这种类型的1.2GB csv文件到我的系统上的Matlab,而不会超出RAM?



编辑:为了澄清,我很好奇是否存在一个只有matlab的解决方案,即我不感兴趣的解决方案涉及使用一个不同的语言将csv文件分成更小的块(因为这是我已经在做的)。对我有用[0]丢个板砖[0]引用回复举报回复赞(0)正在读入Matlab cellstrs,这是低基数字符串的内存低效数据结构。 Cellstrs对于像这样的大表格数据是糟糕的。每个字符串最终存储在一个单独的 char 数组中,每个字符串都有大约400字节的开销和碎片问题。使用您的900MB文件,看起来像1800万行;每行4个字符串,大约10-20 GB的cellstrs来保存这些字符串。



你想要的是将这些字符串转换为紧凑的原始数据类型,而不是让所有的1800万行调入庞大的单元格字符串立刻。日期和时间戳作为datenums或任何数字表示形式,以及那些低基数字符串作为2-d char 数组或等价的分类变量。 (假设您的数据集大小,您可能希望这些字符串表示为简单的数字标识符与查找表,而不是字符。)



一旦你决定了紧凑数据结构,有几种方法来加载它。你可以只打破在纯Matlab中读入块:使用 textscan()调用循环读取1000行一次解析并转换那个块中的cellstrs到它们的紧凑形式,缓冲所有的结果,并且在读取结束时将 cat 放在一起。



如果你要做很多这样的工作,性能很重要,你可能想要下降到Java并编写自己的解析器,它们可以转换字符串和日期,当它们作为更紧凑的数据类型返回到Matlab之前。这不难,而且Java方法可以直接从Matlab中调用,所以这可能只是计算为使用单独的语言。


I run Matlab R2011b and R version 2.13.1 on Linux Mint v12 with 16 GB of RAM.

I have a csv file. The first 5 rows (and header) is:

#RIC,Date[G],Time[G],GMT Offset,Type,Price,Volume
DAEG.OQ,07-JUL-2011,15:10:03.424,-4,Trade,1.68,1008
DAEG.OQ,07-JUL-2011,15:10:03.424,-4,Trade,1.68,1008
DAEG.OQ,07-JUL-2011,15:10:03.424,-4,Trade,1.66,300
DAEG.OQ,07-JUL-2011,15:10:03.424,-4,Trade,1.65,1000
DAEG.OQ,07-JUL-2011,15:10:03.464,-4,Trade,1.65,3180

The file is large (approx 900MB). Given the combination of character and numeric data, one might read this file into matlab as follows:

fid1 = fopen('/home/MyUserName/Temp/X.csv');
D = textscan(fid1, '%s%s%s%f%s%f%f', 'Delimiter', ',', 'HeaderLines', 1);
fclose(fid1);

Although the file is 900MB, when running the above code, System Monitor indicates my RAM usage jumps from about 2GB to 10GB. Worse, if I attempt this same procedure with a slightly larger csv file (about 1.2 GB) my RAM maxes out at 16GB and Matlab never manages to finish reading in the data (it just stays stuck in "busy" mode).

If I wanted to read the same file into R, I might use:

D <- read.csv("/home/MyUserName/Temp/X.csv", stringsAsFactors=FALSE)

This takes a bit longer than Matlab, but system monitor indicates my RAM usage only jumps from 2GB to 3.3GB (much more reasonable given the original file size).

My question has two parts:

1) Why is textscan such a memory hog in this scenario?

2) Is there another approach I could use to get a 1.2GB csv file of this type into Matlab on my system without maxing out the RAM?

EDIT: Just to clarify, I'm curious as to whether there exists a matlab-only solution, ie I'm not interested in a solution that involves using a different language to break up the csv file into smaller chunks (as this is what I'm already doing). Sorry Trav1s, I should have made this clear from the start.

解决方案

The problem is probably that those "%s" strings are being read in to Matlab cellstrs, which are a memory-inefficient data structure for low cardinality strings. Cellstrs are lousy for big tabular data like this. Each string ends up getting stored in a separate primitive char array, each with some 400 bytes of overhead and fragmentation issues. With your 900MB file, that looks like 18 million rows; 4 strings per row, and that's about 10-20 GB of cellstrs to hold those strings. Ugh.

What you want is to convert those strings in to compact primitive datatypes as they're coming in, instead of getting all 18 million rows slurped in to bulky cell strings at once. The dates and timestamps as datenums or whatever numeric representation you're using, and those low-cardinality strings either as 2-d char arrays or some equivalent of a categorical variable. (Given your data set size, you probably want those strings represented as simple numeric identifiers with a lookup table, not chars.)

Once you've decided on your compact data structure, there's a couple approaches to loading it in. You could just break the read in to chunks in pure Matlab: use textscan() calls in a loop to read in 1000 lines at a time, parse and convert the cellstrs in that chunk in to their compact forms, buffer all the results, and cat them together at the end of the read. That'll keep the peak memory requirements lower.

If you're going to do a lot of work like this, and performance matters, you might want to drop down to Java and write your own parser that can convert the strings and dates as they come in, before handing them back to Matlab as more compact datatypes. It's not hard, and the Java method can be called directly from Matlab, so this may only kind of count as using a separate language.

这篇关于与R中的类似方法相比,Matlab中的文本扫描使用了过多的RAM的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆