有效地处理(和生成)大文本文件 [英] Efficiently Working with (and generating) Large Text Files

查看:28
本文介绍了有效地处理(和生成)大文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

作为我工作的一部分,我正在处理非常大的文本文件,并在某种程度上分析它们的单词和短语频率.我在计算时间、内存限制和提取相关信息方面遇到困难.

As part of my work, I am working with very large text files and, in part, analyzing them for word and phrase frequency. I am running into difficulties of computing time, memory restrictions, and in extracting relevant information.

对于这个程序,我使用了一个大文本文件(比如 50MB),它已经被清理过,变成了小写.但除此之外,它只是非结构化文本.我正在尝试生成 'bigrams'、'trigrams'、'quadgrams' 和 'fivegrams' 的列表 - 分别是重复出现的两个、三个、四个和五个单词短语的组合(即我是"是一个二元词组,我是自由的"是一个三元组,我总是自由的"是一个四元组).

For this program, I am taking a large text file (say 50MB) which has already been cleaned up, turned into lower case. But otherwise it is just unstructured text. I am trying to generate lists of 'bigrams,' 'trigrams,' 'quadgrams,' and 'fivegrams' - respectively, combinations of recurring two, three, four, and five word phrases (i.e. "i am" is a bigram, "i am free" is a trigram, "i am free always" is a quadgram).

我现在在做什么?

这是我当前的代码,其中 inputlower 是一个全小写的字符串(使用 Mathematica 抓取的网络数据).

Here is my current code, where inputlower is an all lowercase String (scraped web data w/ Mathematica).

inputlower=Import["/directory/allTextLowered.txt"];
bigrams = 
  Sort[Tally[Partition[inputlower, 2, 1]], #1[[2]] > #2[[2]] &];
Export["/directory/bigrams.txt", bigrams];    
Clear[bigrams];
trigrams = 
  Sort[Tally[Partition[inputlower, 3, 1]], #1[[2]] > #2[[2]] &];
Export["/directory/trigrams.txt", trigrams];
Clear[trigrams];    
quadgrams = 
  Sort[Tally[Partition[inputlower, 4, 1]], #1[[2]] > #2[[2]] &];
Export["/directory/quadrams.txt", quadgrams];
Clear[quadgrams];
fivegrams = 
  Sort[Tally[Partition[inputlower, 5, 1]], #1[[2]] > #2[[2]] &];
Export["/directory/fivegrams.txt", fivegrams];

在某种程度上,它运行良好:我确实生成了信息,并且在较小的规模上我发现这段代码运行得足够快,我可以有一些近似于可行的 Manipulate[] 程序的东西.但是当我们处理输入...

In a way, it works well: I do get information generated, and on smaller scales I find that this code works fast enough that I can have something approximating a workable Manipulate[] program. But when we're dealing with big inputs...

使用大文件有什么问题?

最重要的是,我的输出文件太大而无法使用.有没有办法在代码中指定一个断点:例如,我不想要任何只出现一次的bigrams"?如果这证明仍然留下太多信息,是否有办法指定我不希望文件中出现任何二元组",除非它们出现超过 10 次?即如果我的奶酪"出现了 20 次,我想知道它,但如果我垫"只出现一次,也许丢失它会使文件更易于管理?

Most importantly, my output files are too big to be usable. Is there a way to specify a breaking point in the code: for example, I do not want any 'bigrams' that only appear once? If that proves to still leave too much information, would there be a way to specify that I do not want any 'bigrams' in the file unless they appear more than say 10 times? i.e. if "my cheese" appears 20 times, I want to know about it, but if "i pad" only appears once, maybe losing it would make the file more manageable?

其次,这些过程需要很长时间:仅生成二元组输出就需要两三个多小时.我是否以有效的方式解决了这个问题?

Secondly, these processes take a long time: it took well over two or three hours to generate the bigram output alone. Am I approaching this problem in an efficient manner?

第三,如果我确实有一个包含所有信息的大型 bigram 文件(~650MB+),是否有一种方法让 Mathematica 可以访问信息而无需将其全部加载到内存中——即获取一个名为 bigrams.txt 的文件,学习它包含 {{"i","am"},55} 而不会使系统陷入困境?

Thirdly, if I did have a large bigram file (~650MB+) that contained ALL of the information, is there a way for Mathematica to access information without loading it all into memory - i.e. taking a file named bigrams.txt, learning that it contains {{"i","am"},55} without bogging the system down?

编辑

[截至 11 月 7 日,我已经删除了我提供的示例文件 - 再次感谢所有人]

[As of 7 Dec 11, I've removed the example file that I've put up - thanks to all again]

推荐答案

简介

我将提出的建议与之前给出的大多数建议不同,并且基于索引、哈希表、打包数组、Compress、.mx 文件和 DumpSave 以及其他一些东西.基本思想是以智能方式对文件进行预处理,并将预处理后的定义保存在 .mx 文件中以便快速加载.与其将大部分工作基于从磁盘读取,我建议改变重音,并在内存中完成大部分工作,但找到从磁盘加载数据、将其存储在 RAM 中、处理数据并保存它的方法以时间和内存的方式在磁盘上 - 高效的方式.为了实现这个目标,我将使用我所知道的大多数高效的 Mathematica 结构,用于内存工作和与文件系统的交互.

Introduction

What I will propose is different from most suggestions given before, and based on a combination of indexing, hash-tables, packed arrays, Compress, .mx files and DumpSave, and a few other things. The basic idea is to preprocess the file in a smart way, and save the preprocessed definitions in an .mx file for fast loading. Rather than base most of the work on reading from disk, I propose to shift the accents, and do most of the work in-memory, but find ways to load data from disk, store it in RAM, work with data, and save it on disk in both time and memory - efficient manner. In trying to achieve this goal, I will use most of the efficient Mathematica constructions I am aware of, both for in-memory work and interaction with the file system.

代码如下:

Clear[words];
words[text_String] :=  ToLowerCase[StringCases[text, WordCharacter ..]];

(* Rules to replace words with integer indices, and back *)
Clear[makeWordIndexRules];
makeWordIndexRules[sym_Symbol, words : {__String}] :=
   With[{distinctWords = DeleteDuplicates[words]},
    sym["Direct"] = Dispatch[Thread[distinctWords -> Range[Length[distinctWords]]]];
    sym["Inverse"] = Dispatch[Thread[ Range[Length[distinctWords]] -> distinctWords]];
    sym["keys"] = distinctWords;
];

(* Make a symbol with DownValues / OwnValues self - uncompressing *)
ClearAll[defineCompressed];
SetAttributes[defineCompressed, HoldFirst];
defineCompressed[sym_Symbol, valueType_: DownValues] :=
  With[{newVals = 
     valueType[sym] /.
       Verbatim[RuleDelayed][
         hpt : Verbatim[HoldPattern][HoldPattern[pt_]], rhs_] :>
            With[{eval = Compress@rhs}, hpt :> (pt = Uncompress@ eval)]
        },
        ClearAll[sym];
        sym := (ClearAll[sym]; valueType[sym] = newVals; sym)
];


(* Get a list of indices corresponding to a full list of words in a text *)
Clear[getWordsIndices];
getWordsIndices[sym_, words : {__String}] :=
    Developer`ToPackedArray[words /. sym["Direct"]];

(* Compute the combinations and their frequencies *)
Clear[getSortedNgramsAndFreqs];
getSortedNgramsAndFreqs[input_List, n_Integer] :=
   Reverse[#[[Ordering[#[[All, 2]]]]]] &@ Tally[Partition[input, n, 1]];


(* 
 ** Produce n-grams and store them in a hash-table. We split combinations from
 ** their frequencies, and assume indices for input, to utilize packed arrays 
 *)
Clear[produceIndexedNgrams];
produceIndexedNgrams[sym_Symbol, input_List, range : {__Integer}] :=
 Do[
   With[{ngramsAndFreqs = getSortedNgramsAndFreqs[input, i]},
     sym["NGrams", i] = Developer`ToPackedArray[ngramsAndFreqs[[All, 1]]];
     sym["Frequencies", i] =  Developer`ToPackedArray[ngramsAndFreqs[[All, 2]]]
   ],
   {i, range}];


(* Higher - level function to preprocess the text and populate the hash - tables *)
ClearAll[preprocess];
SetAttributes[preprocess, HoldRest];
preprocess[text_String, inputWordList_Symbol, wordIndexRuleSym_Symbol,
    ngramsSym_Symbol, nrange_] /; MatchQ[nrange, {__Integer}] :=
  Module[{},
    Clear[inputWordList, wordIndexRuleSym, ngramsSym];
    inputWordList = words@text;
    makeWordIndexRules[wordIndexRuleSym, inputWordList];
    produceIndexedNgrams[ngramsSym,
    getWordsIndices[wordIndexRuleSym, inputWordList], nrange]
  ];

(* Higher - level function to make the definitions auto-uncompressing and save them*)
ClearAll[saveCompressed];
SetAttributes[saveCompressed, HoldRest];
saveCompressed[filename_String, inputWordList_Symbol, wordIndexRuleSym_Symbol, 
   ngramsSym_Symbol] :=
Module[{},
   defineCompressed /@ {wordIndexRuleSym, ngramsSym};
   defineCompressed[inputWordList, OwnValues];
   DumpSave[filename, {inputWordList, wordIndexRuleSym, ngramsSym}];
];

上述功能非常消耗内存:为了处理@Ian 的文件,有时需要将近 5Gb 的 RAM.但是,这是值得的,如果没有足够的 RAM,还可以使用较小的文件测试上述内容.通常,可以将大文件分成几个部分来解决此问题.

The above functionality is very memory - hungry: for processing of @Ian's file it took at some point nearly 5Gb of RAM. However, this is worth it, and one can also test the above with smaller files if there is not enough RAM. Generally, large files could be split into several parts, to work around this problem.

我们现在开始.在我的机器上预处理大约需要一分钟:

We now start. Preprocessing takes about a minute on my machine:

test = Import["C:\\Temp\\lowered-text-50.txt", "Text"];

In[64]:= preprocess[test,inputlower,wordIndexRules,ngrams,{2,3}];//Timing
Out[64]= {55.895,Null}

符号 inputlowerwordIndexRulesngrams 这里是我选择用于文件中单词列表和哈希的一些符号-表.以下是一些额外的输入,说明了这些符号的使用方式及其含义:

The symbols inputlower, wordIndexRules, ngrams here are some symbols I chose to use for the list of words in a file and for hash-tables. Here are some additional inputs illustrating how these symbols are used and what they mean:

In[65]:= ByteCount[inputlower]
Out[65]= 459617456

In[69]:= inputlower[[1000;;1010]]
Out[69]= {le,fort,edmonton,le,principal,entrepôt,de,la,compagnie,de,la}

In[67]:= toNumbers = inputlower[[1000;;1010]]/.wordIndexRules["Direct"]
Out[67]= {58,220,28,58,392,393,25,1,216,25,1}

In[68]:= toWords =toNumbers/. wordIndexRules["Inverse"]
Out[68]= {le,fort,edmonton,le,principal,entrepôt,de,la,compagnie,de,la}

In[70]:= {ngrams["NGrams",2],ngrams["Frequencies",2]}//Short
Out[70]//Short= {{{793,791},{25,1},{4704,791},<<2079937>>,{79,80},{77,78},{33,34}},{<<1>>}}

这里的主要思想是我们使用整数索引而不是单词(字符串),这允许我们将压缩数组用于 n-gram.

The main idea here is that we use integer indices instead of words (strings), which allows us to utilize packed arrays for n-grams.

压缩和保存又需要半分钟:

Compressing and saving takes another half a minute:

In[71]:= saveCompressed["C:\\Temp\\largeTextInfo.mx", inputlower, 
      wordIndexRules, ngrams] // Timing
Out[71]= {30.405, Null}

生成的 .mx 文件大约 63MB,大约是原始文件的大小.请注意,由于我们保存的部分是(自压缩)变量 inputlower,其中包含原始顺序中的所有输入单词,因此与原始文件相比,我们不会丢失任何信息.原则上,从现在开始只能使用新的 .mx 文件.

The resulting .mx file is about 63MB, which is about the size of the original file. Note that since a part of what we save is a (self-compressed) variable inputlower, which contains all the input words in the original order, we are not losing any information as compared to the original file. In principle, one can from now on start working with the new .mx file only.

我们现在通过退出内核开始一个新的会话.加载文件几乎不需要时间(.mx格式效率极高):

We now start a new session by quitting the kernel. It takes almost no time to load the file (.mx format is extremely efficient):

In[1]:= Get["C:\\Temp\\largeTextInfo.mx"] // Timing
Out[1]= {0.016, Null}

加载单词列表需要一些时间(自行解压缩):

Loading the list of words takes some time (self - uncompressing):

In[2]:= inputlower//Short//Timing
Out[2]= {6.52,{la,présente,collection,numérisée,<<8000557>>,quicktime,3,0}}

但我们不将它用于任何用途 - 它被存储以防万一.加载 2 克及其频率:

but we don't use it for anything - it is stored just in case. Loading the 2-grams and their frequencies:

In[3]:= Timing[Short[ngrams2 = {ngrams["NGrams",2],ngrams["Frequencies",2]}]]
Out[3]= {0.639,{{{793,791},{25,1},{4704,791},<<2079937>>,{79,80},{77,78},{33,34}},{<<1>>}}}

请注意,这里的大部分时间都花在自解压上,这是有选择性的(因此,例如,ngrams["NGrams",3] 仍然是压缩的).加载 3 克及其频率:

Note that most of the time here was spent on self - uncompressing, which is selective (so, e.g., ngrams["NGrams",3] is still compressed). Loading the 3-grams and their frequencies:

In[4]:= Timing[Short[ngrams3 = {ngrams["NGrams",3],ngrams["Frequencies",3]}]]
Out[4]= {1.357,{{{11333,793,11334},{793,11334,11356},<<4642628>>,{18,21,22},{20,18,21}},{<<1>>}}}

考虑到列表的大小,时间安排不错.请注意,DumpSave - GetCompress - Uncompress 都不会解包压缩数组,因此我们的数据非常有效地存储在 Mathematica 的内存中:

The timings are decent, given the size of the lists. Note that neither DumpSave - Get nor Compress - Uncompress unpack packed arrays, so our data is stored in Mathematica's memory pretty efficiently:

In[5]:= Developer`PackedArrayQ/@ngrams3
Out[5]= {True,True}

这里我们将索引与单词相关的规则解压缩:

Here we uncompress the rules relating indices to words:

In[6]:= Timing[Short[wordIndexRules["Inverse"]]]
Out[6]= {0.905,Dispatch[{1->la,2->présente,<<160350>>,160353->7631,160354->jomac},-<<14>>-]}

这足以开始处理数据,但在下一节中,我将概述一些有关如何使这项工作更高效的提示.

This is enough to start working with the data, but in the next section I outline some hints on how to make this work yet more efficient.

例如,如果我们尝试找到频率为 1 的 2-gram 的所有位置,则最简单的方法是:

If we try to find, for example, all positions of 2-grams with frequency 1, the naive way is this:

In[8]:= Position[ngrams["Frequencies",3],1,{1}]//Short//Timing
Out[8]= {1.404,{{870044},{870045},{870046},<<3772583>>,{4642630},{4642631},{4642632}}}

然而,我们可以利用存储在压缩数组中的整数索引(而不是单词)这一事实.这是自定义位置函数的一个版本(由于 Norbert Pozar):

However, we can leverage the fact that we work with integer indices (instead of words) stored in a packed array. Here is one version of the custom position function (due to Norbert Pozar):

extractPositionFromSparseArray[HoldPattern[SparseArray[u___]]] := {u}[[4, 2, 2]]; 
positionExtr[x_List, n_] := 
    extractPositionFromSparseArray[SparseArray[Unitize[x - n], Automatic, 1]]

使用它,我们的速度提高了 10 倍(可以使用编译为 C 的函数,但速度要快两倍):

Using this, we get it 10 times faster (one can use a compiled to C function which is yet twice faster):

In[9]:= positionExtr[ngrams["Frequencies",3],1]//Short//Timing
Out[9]= {0.156,{{870044},{870045},{870046},<<3772583>>,{4642630},{4642631},{4642632}}}

这里有一些更方便的功能:

Here are some more convenience functions:

Clear[getNGramsWithFrequency];
getNGramsWithFrequency[ngramSym_Symbol, n_Integer, freq_Integer] :=
  Extract[ngramSym["NGrams", n], positionExtr[ngramSym["Frequencies", n], freq]];

Clear[deleteNGramsWithFrequency];
deleteNGramsWithFrequency[{ngrams_List, freqs_List}, freq_Integer] :=  
    Delete[#, positionExtr[freqs, freq]] & /@ {ngrams, freqs};

deleteNGramsWithFrequency[ngramSym_Symbol, n_Integer, freq_Integer] :=  
   deleteNGramsWithFrequency[{ngramSym["NGrams", n], ngramSym["Frequencies", n]}, freq];

使用which,我们可以非常有效地获得很多东西.例如,删除频率为 1 的 2-gram:

Using which, we can get many things pretty efficiently. For example, delete the 2-grams with frequency 1:

In[15]:= deleteNGramsWithFrequency[ngrams,2,1]//Short//Timing
Out[15]= {0.218,{{{793,791},{25,1},{4704,791},<<696333>>,{29,66},{36,37},{18,21}},{<<1>>}}}

或者,频率小于 100 的 2 克(这是执行此操作的次优方法,但它仍然相当快):

Or, 2-grams with frequencies less than 100 (this is sub-optimal way to do this, but it is still pretty fast):

In[17]:= (twogramsLarger100 = 
  Fold[deleteNGramsWithFrequency,deleteNGramsWithFrequency[ngrams,2,1],Range[2,100]])
  //Short//Timing
Out[17]= {0.344,{{{793,791},{25,1},{4704,791},{25,10},<<6909>>,
  {31,623},{402,13},{234,25}},{<<1>>}}}

主要思想是整数索引对单词起到了指针"的作用,大多数事情都可以用它们来完成.需要的时候,我们可以回到正常的话:

The main idea is that integer indices play a role of "pointers" for words, and most things can be done with them. When needed, we can return to the normal words:

In[18]:= twogramsLarger100/.wordIndexRules["Inverse"]//Short//Timing
Out[18]= {0.063,{{{of,the},{de,la},<<6912>>,{société,du},{processus,de}},{<<1>>}}}

结束语

这里实现的加速似乎很可观.通过以细粒度的块加载数据,可以控制数据占用多少 RAM.通过使用压缩数组,内存使用本身得到了极大的优化.磁盘上的内存节省归因于 CompressDumpSave 的组合.哈希表、Dispatch-ed 规则和自解压是用于使其更方便的技术.

Concluding remarks

The speed-up achieved here seems substantial. One can control how much RAM is occupied by the data, by loading data in fine-grained chunks. The memory usage itself has been hugely optimized by utilizing packed arrays. The memory savings on disk are due to the combination of Compress and DumpSave. Hash-tables, Dispatch-ed rules and self-uncompressing are techniques used to make it more convenient.

这里有足够的空间进行进一步的改进.可以将数据分成更小的块并分别压缩/保存,以避免中间步骤中的高内存使用.还可以根据频率范围拆分数据,并将拆分后的数据保存到单独的文件中,以加快加载/自解压阶段.对于许多文件,需要对此进行概括,因为这里使用了哈希的全局符号.这似乎是应用一些 OOP 技术的好地方.一般来说,这只是一个起点,但我的意思是,这种方法 IMO 有很好的潜力可以有效地处理此类文件.

There is plenty of room here for further refinements. One can split data in smaller chunks and compress / save those separately, to avoid high memory usage in intermediate steps. One can also split data according to the frequency ranges, and save the data so split into separate files, to speed up the load / self - uncompress stage. For many files, one would need to generalize this, since global symbols for hashes were used here. This seems a good place to apply some OOP techniques. Generally, this is just a starting point, but my message is that this approach IMO has a good potential for efficient work with such files.

这篇关于有效地处理(和生成)大文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆