使用 mathematica 导入大文件/数组 [英] Import big files/arrays with mathematica

查看:40
本文介绍了使用 mathematica 导入大文件/数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 Windows7 32 位平台上使用 mathematica 8.0.1.0.我尝试使用

I work with mathematica 8.0.1.0 on a Windows7 32bit platform. I try to import data with

Import[file,"Table"]

只要文件(文件中的数组)足够小,它就可以正常工作.但是对于更大的文件(38MB)/阵列(9429 乘以 2052),我收到了消息:

which works fine as long as the file (the array in the file) is small enough. But for bigger files(38MB)/array(9429 times 2052) I get the message:

No more memory available. Mathematica kernel has shut down. Try quitting other applications and then retry.

在具有更多主内存的 Windows7 64 位平台上,我可以导入更大的文件,但我认为有一天当文件增长/数组有更多行时,我会遇到同样的问题.

On my Windows7 64bit platform with more main memory I can import bigger files, but I think that I will have there the same problem one day when the file has grown/the array has more rows.

所以,我尝试找到导入大文件的解决方案.搜索了一段时间后,我在这里看到了一个类似的问题:Wolfram Mathematica 中处理大数据文件的方法.但似乎我的数学知识不足以适应建议的 OpenRead、ReadList 或类似我的数据(参见 here 示例文件).问题是我需要文件中数组的其余程序信息,例如尺寸,某些列和行的最大/最小值,并且我正在对某些列和每一行进行操作.但是当我使用例如ReadList,我从来没有像使用导入一样获得相同的数组信息(可能是因为我做错了).

So, I try to find a solution to import big files. After searching for some time, I have seen here a similar question: Way to deal with large data files in Wolfram Mathematica. But it seems that my mathematica knowledge is not good enough to adapt the suggested OpenRead, ReadList or similar to my data (see here the example file). The problem is that I need for the rest of my program information of the array in the file, such as Dimensions, Max/Min in some columns and rows, and I am doing operations on some columns and every row. But when I am using e.g. ReadList, I never get the same information of the array as I have got with Import (probably because I am doing it in the wrong way).

这里有人能给我一些建议吗?我将不胜感激!

Could somebody here give me some advice? I would appreciate every support!

推荐答案

出于某种原因,当前对 Table(表格数据)类型的 Import 的实现相当内存 - 效率低下.下面我尝试在某种程度上纠正这种情况,同时仍然重用 Mathematica 的高级导入功能(通过 ImportString).对于稀疏表,提供了一个单独的解决方案,这可以显着节省内存.

For some reason, the current implementation of Import for the type Table (tabular data) is quite memory - inefficient. Below I've made an attempt to remedy this situation somewhat, while still reusing Mathematica's high-level importing capabilities (through ImportString). For sparse tables, a separate solution is presented, which can lead to very significant memory savings.

这是一个更多内存 - 高效的功能:

Here is a much more memory - efficient function:

Clear[readTable];
readTable[file_String?FileExistsQ, chunkSize_: 100] :=
   Module[{str, stream, dataChunk, result , linkedList, add},
      SetAttributes[linkedList, HoldAllComplete];
      add[ll_, value_] := linkedList[ll, value];           
      stream  = StringToStream[Import[file, "String"]];
      Internal`WithLocalSettings[
         Null,
         (* main code *)
         result = linkedList[];
         While[dataChunk =!= {},
           dataChunk = 
              ImportString[
                 StringJoin[Riffle[ReadList[stream, "String", chunkSize], "\n"]], 
                 "Table"];
           result = add[result, dataChunk];
         ];
         result = Flatten[result, Infinity, linkedList],
         (* clean-up *)
         Close[stream]
      ];
      Join @@ result]

这里我用标准的 Import 来面对它,用于您的文件:

Here I confront it with the standard Import, for your file:

In[3]:= used = MaxMemoryUsed[]
Out[3]= 18009752

In[4]:= 
tt = readTable["C:\\Users\\Archie\\Downloads\\ExampleFile\\ExampleFile.txt"];//Timing
Out[4]= {34.367,Null}

In[5]:= used = MaxMemoryUsed[]-used
Out[5]= 228975672

In[6]:= 
t = Import["C:\\Users\\Archie\\Downloads\\ExampleFile\\ExampleFile.txt","Table"];//Timing
Out[6]= {25.615,Null}

In[7]:= used = MaxMemoryUsed[]-used
Out[7]= 2187743192

In[8]:= tt===t
Out[8]= True

您可以看到,我的代码比 Import 的内存效率高 10 倍左右,但速度并没有慢多少.您可以通过调整 chunkSize 参数来控制内存消耗.您生成的表占用大约 150 - 200 MB 的 RAM.

You can see that my code is about 10 times more memory-efficient than Import, while being not much slower. You can control the memory consumption by adjusting the chunkSize parameter. Your resulting table occupies about 150 - 200 MB of RAM.

编辑

我想说明如何使这个函数在导入期间将内存效率提高 2-3 倍,再加上另一个数量级的内存效率,就您的最终内存占用而言表,使用 SparseArray-s.我们获得内存效率提升的程度在很大程度上取决于您的表的稀疏程度.在您的示例中,该表非常稀疏.

I want to illustrate how one can make this function yet 2-3 times more memory-efficient during the import, plus another order of magnitude more memory-efficient in terms of final memory occupied by your table, using SparseArray-s. The degree to which we get memory efficiency gains depends much on how sparse is your table. In your example, the table is very sparse.

我们从一个用于构造和解构 SparseArray 对象的通用 API 开始:

We start with a generally useful API for construction and deconstruction of SparseArray objects:

ClearAll[spart, getIC, getJR, getSparseData, getDefaultElement, makeSparseArray];
HoldPattern[spart[SparseArray[s___], p_]] := {s}[[p]];
getIC[s_SparseArray] := spart[s, 4][[2, 1]];
getJR[s_SparseArray] := Flatten@spart[s, 4][[2, 2]];
getSparseData[s_SparseArray] := spart[s, 4][[3]];
getDefaultElement[s_SparseArray] := spart[s, 3];
makeSparseArray[dims : {_, _}, jc : {__Integer}, ir : {__Integer}, 
     data_List, defElem_: 0] :=
 SparseArray @@ {Automatic, dims, defElem, {1, {jc, List /@ ir}, data}};

一些简短的评论是有序的.这是一个示例稀疏数组:

Some brief comments are in order. Here is a sample sparse array:

In[15]:= 
ToHeldExpression@ToString@FullForm[sp  = SparseArray[{{0,0,1,0,2},{3,0,0,0,4},{0,5,0,6,7}}]]

Out[15]= 
Hold[SparseArray[Automatic,{3,5},0,{1,{{0,2,4,7},{{3},{5},{1},{5},{2},{4},{5}}},
{1,2,3,4,5,6,7}}]]

(我使用ToString - ToHeldExpression 循环来转换FullForm 中的List[...] 等回到 {...} 以便于阅读).这里,{3,5} 显然是维度.接下来是 0,默认元素.接下来是一个嵌套列表,我们可以将其表示为 {1,{ic,jr}, sparseData}.这里,ic 给出了我们添加行时的非零元素总数 - 所以它是第一个 0,然后是第一行之后的 2,第二个再添加 2 个,最后一个再添加 3 个.下一个列表 jr 给出了所有行中非零元素的位置,因此它们是 35 对于第一行,15 为第二个,245 为最后一个一.此处哪一行开始和结束的位置没有混淆,因为这可以由 ic 列表确定.最后,我们有 sparseData,它是从左到右逐行读取的非零元素列表(顺序与 jr列表).这解释了 SparseArray-s 存储元素的内部格式,并希望阐明上述函数的作用.

(I used ToString - ToHeldExpression cycle to convert List[...] etc in the FullForm back to {...} for the ease of reading). Here, {3,5} are obviously dimensions. Next is 0, the default element. Next is a nested list, which we can denote as {1,{ic,jr}, sparseData}. Here, ic gives a total number of nonzero elements as we add rows - so it is first 0, then 2 after first row, the second adds 2 more, and the last adds 3 more. The next list, jr, gives positions of non-zero elements in all rows, so they are 3 and 5 for the first row, 1 and 5 for the second, and 2, 4 and 5 for the last one. There is no confusion as to where which row starts and ends here, since this can be determined by the ic list. Finally, we have the sparseData, which is a list of the non-zero elements as read row by row from left to right (the ordering is the same as for the jr list). This explains the internal format in which SparseArray-s store their elements, and hopefully clarifies the role of the functions above.

Clear[readSparseTable];
readSparseTable[file_String?FileExistsQ, chunkSize_: 100] :=
   Module[{stream, dataChunk, start, ic = {}, jr = {}, sparseData = {}, 
        getDataChunkCode, dims},
     stream  = StringToStream[Import[file, "String"]];
     getDataChunkCode := 
       If[# === {}, {}, SparseArray[#]] &@
         ImportString[
             StringJoin[Riffle[ReadList[stream, "String", chunkSize], "\n"]], 
             "Table"];
     Internal`WithLocalSettings[
        Null,
        (* main code *)
        start = getDataChunkCode;
        ic = getIC[start];
        jr = getJR[start];
        sparseData = getSparseData[start];
        dims = Dimensions[start];
        While[True,
           dataChunk = getDataChunkCode;
           If[dataChunk === {}, Break[]];
           ic = Join[ic, Rest@getIC[dataChunk] + Last@ic];
           jr = Join[jr, getJR[dataChunk]];
           sparseData = Join[sparseData, getSparseData[dataChunk]];
           dims[[1]] += First[Dimensions[dataChunk]];
        ],
        (* clean - up *)
        Close[stream]
     ];
     makeSparseArray[dims, ic, jr, sparseData]]

基准和比较

这是已用内存的起始量(新内核):

Benchmarks and comparisons

Here is the starting amount of used memory (fresh kernel):

In[10]:= used = MemoryInUse[]
Out[10]= 17910208

我们调用我们的函数:

In[11]:= 
(tsparse= readSparseTable["C:\\Users\\Archie\\Downloads\\ExampleFile\\ExampleFile.txt"]);//Timing
Out[11]= {39.874,Null}

因此,它与readTable 的速度相同.内存使用情况如何?

So, it is the same speed as readTable. How about the memory usage?

In[12]:= used = MaxMemoryUsed[]-used
Out[12]= 80863296

我认为,这非常了不起:我们使用的内存仅是磁盘上文件占用的内存的两倍.但是,更显着的是,最终的内存使用量(在计算完成后)已经大大减少了:

I think, this is quite remarkable: we only ever used twice as much memory as is the file on disk occupying itself. But, even more remarkably, the final memory usage (after the computation finished) has been dramatically reduced:

In[13]:= MemoryInUse[]
Out[13]= 26924456

这是因为我们使用了SparseArray:

In[15]:= {tsparse,ByteCount[tsparse]}
Out[15]= {SparseArray[<326766>,{9429,2052}],12103816}

因此,我们的表只需要 12 MB 的 RAM.我们可以将其与更通用的函数进行比较:

So, our table takes only 12 MB of RAM. We can compare it to our more general function:

In[18]:= 
(t = readTable["C:\\Users\\Archie\\Downloads\\ExampleFile\\ExampleFile.txt"]);//Timing
Out[18]= {38.516,Null}

一旦我们将稀疏表转换回正常,结果是一样的:

The results are the same once we convert our sparse table back to normal:

In[20]:= Normal@tsparse==t
Out[20]= True

而普通表占用的空间要大得多(看起来ByteCount将占用的内存多计了大约3-4倍,但真正的差异仍然至少是一个数量级):

while the normal table occupies vastly more space (it appears that ByteCount overcounts the occupied memory about 3-4 times, but the real difference is still at least order of magnitude):

In[21]:= ByteCount[t]
Out[21]= 619900248

这篇关于使用 mathematica 导入大文件/数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆