超大文本文件解析(大小超过100GB) [英] Ultra large text file parsing(size more that 100GB)

查看:235
本文介绍了超大文本文件解析(大小超过100GB)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

嘿伙计们,帮助!!!



最近我为细菌基因组做过基因组注释工作,现在其中一个注释结果是NCBI blastp输出日志文本文件,它的大小为100GB或者稍后可能更大,我之前编写了一个用于解析blastp输出文件的类,但是在此之前我用这个类处理的文本文件大小只有2GB以下。



现在面对的文本文件大小超过100GB,.NET正则表达式和StringBuilder类对象不能再处理这么大的字符串了,它抛出我的1TB内存Linux服务器上的OutOfMemory异常!



谁能帮帮我.......我知道这很疯狂..



这是加载代码,基于异常信息,程序停留在这里:



Hey guys, help!!!

Recently I have done a genome annotation job for a bacteria genome and now one of the annotation result is NCBI blastp output log text file, it is in size of 100GB or maybe larger in the later time, i have wrote a class for parsing the blastp output file before, but the text file size I deal with with this class before this time just in size below 2GB.

Now facing with the text file in size more that 100GB, the .NET Regular Expression and StringBuilder class object can not handle such a huge size string any more, it throw me an OutOfMemory Exception on our 1TB memory Linux server!

Who can help me....... I know this is very crazy..

Here is the loading code, base on the exception information, the program stuck at here:

Const BLAST_QUERY_HIT_SECTION As String = "Query=.+?Effective search space used: \d+"

Dim SourceText As String = IO.File.ReadAllText(LogFile) 
Dim Sections As String() = (From matche As Match
                            In Regex.Matches(SourceText, BLAST_QUERY_HIT_SECTION, RegexOptions.Singleline + RegexOptions.IgnoreCase)
                            Select matche.Value).ToArray







爆炸输出文件由许多这样的部分组成,我认为很容易理解使用正则表达式解析它和正则表达式使代码干净:






The blast output file is consist of many many section like this, i think it is easy to understand using a regular expression to parsing it and the regular expression makes the code clean:

blablablabla.........

Query= XC_0118 transcriptional regulator

Length=1113
                                                                      Score     E
Sequences producing significant alignments:                          (Bits)  Value

  lcl5167|ana:all4503 two-component response regulator; K07657 tw...  57.4    2e-009
  lcl2658|cyp:PCC8801_3460 winged helix family two component tran...  56.6    4e-009
  lcl8962|tnp:Tnap_0756 two component transcriptional regulator, ...  55.1    7e-009
  lcl9057|kol:Kole_0706 two component transcriptional regulator, ...  55.1    1e-008
  lcl9114|ter:Tery_2902 two component transcriptional regulator; ...  55.1    1e-008
  lcl9051|trq:TRQ2_0821 two component transcriptional regulator (...  54.7    1e-008
  lcl9023|tpt:Tpet_0798 two component transcriptional regulator (...  54.7    1e-008
  lcl8929|tma:TM0126 response regulator; K02483 two-component sys...  54.3    1e-008
  lcl8992|tna:CTN_0563 Response regulator (A)|[Regulog=ZnuR - The...  52.4    6e-008


blablablabla.........


> lcl5167|ana:all4503 two-component response regulator; K07657 
two-component system, OmpR family, phosphate regulon response 
regulator PhoB (A)|[Regulog=SphR - Cyanobacteria] [tfbs=all3651:-119;alr5291:-229;all4021:-136;alr5259:-329;all1758:-49;all0129:-101;all3822:-149;alr4975:-10;alr2234:-275;all0207:-98;all0911:-105;all4575:-324]
Length=253

 Score = 57.4 bits (137),  Expect = 2e-009, Method: Compositional matrix adjust.
 Identities = 36/109 (33%), Positives = 52/109 (48%), Gaps = 1/109 (1%)

Query  3    LRSERVTQLGSVPRFRLGPLLVEPERLMLIGDGERITLEPRMMEVLVALAERAGEVISAE  62
            LR +R+  L  +P  +   + + P+   ++  G+ + L P+   +L      A  V S E
Sbjct  144  LRRQRLITLPQLPVLKFKDVTLNPQECRVLVRGQEVNLSPKEFRLLELFMSYARRVWSRE  203

Query  63   QLLIDVWHGSFYGDNP-VHKTIAQLRRKLGDDSRQPRFIETIRKRGYRL  110
            QLL  VW   F GD+  V   I  LR KL  D   P +I T+R  GYR 
Sbjct  204  QLLDQVWGPDFVGDSKTVDVHIRWLREKLEQDPSHPEYIVTVRGFGYRF  252


blablablabla.........


Lambda      K        H        a         alpha
   0.321    0.133    0.395    0.792     4.96 

Gapped
Lambda      K        H        a         alpha    sigma
   0.267   0.0410    0.140     1.90     42.6     43.6 

Effective search space used: 1655396995

blablablabla.........

推荐答案

从你的正则表达式模式
const string BLAST_QUERY_HIT_SECTION = "Query=.+?Effective search space used: \d+";

(是的,我转换为C#)

我猜你的文件格式是这样的:

(Yes I converted to C#)
I'd guess that your file format is something like:

possibly some stuff in the line ahead of Query=something query identifier-ish
lots (multiple? lines) of query result stuff ...
possibly some stuff in the line ahead of Effective search space used: some digits



那么为什么不编写一个相当简单的循环,一次查看一行的Blast日志文件并将查询结果组装为件?


So why not write a fairly simple loop that looks at the Blast log file a line at a time and assembles the query results as the pieces?

public static IEnumerable<string> BlastQueryHits(string logFile)
{
  bool inHit = false;
  StringBuilder sb = new StringBuilder();
  foreach (string line in File.ReadLines(logFile))
  {
    if (inHit)
    {
      sb.Append(line);
      if (line.StartsWith("Effective search space used:"))  // from discussion comment below, this is more efficient
      {
        yield return sb.ToString();
        inHit = false;
      }
    }
    else if (line.StartsWith("Query="))  // from discussion comment below, this is more efficient
    {
      sb.Clear().Append(line);
      inHit = true;
    }
  }
  if (inHit)
     yield return sb.ToString();   // just in case there's any "leftovers"
}



这将以字符串形式返回每个查询命中。

它们将根据需要组装,因此无需将所有内容都包含在内记忆在同一时间。

一次与他们一起工作!!!



警告:这代码脱离了我的头脑......我没有编译或执行它。它可能有一个问题或两个,但应该非常接近! ;-)



编辑:根据以下评论修改。



在并行处理场景中可以使用的方式是喜欢


This will return each of the query hits as a string.
They will be assembled as needed, so there's no need to have everything in memory at the same time.
Work with them one at a time!!!

Caveat: This code is "off the top of my head"... I didn't compile or execute it. It might have an "issue" or two, but should be pretty close! ;-)

revised based on comments below.

The way this could be used in a parallel processing scenario would be with something like:

string logFile = "path to blast log file";
Parallel.ForEach(BlastQueryHits(logFile), DoSomethingWithOneQuery);

private void DoSomethingWithOneQuery(string queryHit)
{
  // do the per-query-hit processing
}



这仍然会产生查询仅按点击。因此,如果它并行化超过3个核心,它将立即生成前3个核心,然后生成下一个核心,因为之前的核心已完成。但是它仍然不会立刻产生它们,让它们坐在内存中直到处理到达它们。这应该具有基本更低的内存占用量。

BlastQueryHits()方法相当于你的Regex.Match,但效率更高,可以任意大file(只要每个chunk小于字符串的2GB限制)。


This will still produce the query hits only as needed. So if it parallelizes over 3 cores, it will produce the first 3 right away and then the next ones, as the previous ones are completed. But it still will not produce them all at once, leaving them sitting around in memory until the processing gets to them. This should have a substantially lower memory footprint.
The BlastQueryHits() method is equivalent to your Regex.Match, but it is much more efficient and will work with an arbitrarily large file (as long as each chunk is less than the 2GB limit for strings).


即使块不是预定义的大小,你应该找到一个方法将文件拆分成块 - 或者获取每个块限制的位置。

如果这是一个实验室应用程序,我不会使用c#。我会用python为例。它实现了更复杂的文本操作技术。



另一方面,你可以通过将所有东西加载到内存中来做什么,你也可以跳过文件。不会很快,但会奏效。在您匹配模式Query =。+?使用的有效搜索空间:\d +的情况下,根本不复杂。你需要一个简单的模式匹配algorythm。 Matt T Heffron的答案包含一种方法,但在某些情况下可能不是低级。

更好的是,你的patern是一个2型语言语句,因此你可以找到它一个简单的状态机实现。将该文件视为simgle数组字符。如果你这样看,你不需要将它加载到内存中。如果你想加快速度,可以加载固定大小的页面并将其偏移,或者你可以尝试使用 MemoryMappedFile [ ^ ],因为这就是这样做的。
Even if chunks are not of predefined size, you should find a method to split the file to chunks - or to get the position of every chunk limit.
Still if this is a lab application, I would not use c#. I would use python for example. It has more sophisticated text manipulation techniques implemented.

On the other hand, what you can do by loading everything into the memory you can do also jumping around the file. Won't be quick, but will work. In your case matching the pattern "Query=.+?Effective search space used: \d+" is not complex at all. You need a simple pattern matching algorythm. Matt T Heffron's answer contains one approach, but might not be "low level" enough in some cases.
Even better, your patern is a "type 2 language statement", thus you can look for it with a simple state machine implementation. Consider the file as a simgle array of characters. If you see it this way, you don't need to load it in the memory. If you want to speed it up, you can load a page of fixed size and offset it or you can try using MemoryMappedFile[^], since this is what's this made for.


嘿,伙计,我已经解决了如何处理这个超大尺寸文本文件解析工作,它包含3个步骤:



1.将所有数据加载到内存中并分成块在大小786MB,似乎UTF8.GetString函数无法处理大于1GB的大小,然后将块缓存到列表中

2.使用正则表达式解析该部分,由于正则表达式匹配函数只使用一个单独的线程进行解析,因此使用并行linq可以加快这项工作

3.执行secti在文本解析工作,就像我之前做的那样。



这是我的代码:



Hey, guy, i have work out how to dealing with this ultra large size text file parsing job, it contains 3 steps:

1. Loading all of the data into memory and split into chunk in size 786MB, it seems the UTF8.GetString function can not handle the size large than 1GB and then cache the chunk into a list
2. using the regular expression to parsing the section, due to the regex matching function just using one single thread for its parsing job, so that using parallel linq can speed up this job
3. do the section text parsing job as i does before.

here is my code:

''' <summary>
''' size smaller than 1GB
''' </summary>
''' <param name="LogFile">Original plant text file path of the blast output file.</param>
''' <returns></returns>
''' <remarks></remarks>
Public Shared Function TryParse(LogFile As String) As v228
   Call Console.WriteLine("Regular Expression parsing blast output...")

   Using p As Microsoft.VisualBasic.ConsoleProcessBar = New ConsoleProcessBar
      Call p.Start()

      Dim SourceText As String = IO.File.ReadAllText(LogFile) 'LogFile.ReadUltraLargeTextFile(System.Text.Encoding.UTF8)
      Dim Sections As String() = (From matche As Match
                                  In Regex.Matches(SourceText, BLAST_QUERY_HIT_SECTION, RegexOptions.Singleline + RegexOptions.IgnoreCase)
                                  Select matche.Value).ToArray

      Call Console.WriteLine("Parsing job done!")

      Dim Sw As Stopwatch = Stopwatch.StartNew
#If DEBUG Then
      Dim LQuery = (From Line As String In Sections Let query = v228.Query.TryParse(Line) Select query Order By query.QueryName Ascending).ToArray
#Else
      Dim LQuery = (From Line As String In Sections.AsParallel Let query = v228.Query.TryParse(Line) Select query Order By query.QueryName Ascending).ToArray
#End If
      Dim BLASTOutput As v228 = New v228 With {.FilePath = LogFile & ".xml", .Queries = LQuery}
      Console.WriteLine("BLASTOutput file loaded: {0}ms", Sw.ElapsedMilliseconds)

      Return BLASTOutput
   End Using
End Function

''' <summary>
''' It seems 786MB possibly is the up bound of the Utf8.GetString function.
''' </summary>
''' <remarks></remarks>
Const CHUNK_SIZE As Long = 1024 * 1024 * 786
Const BLAST_QUERY_HIT_SECTION As String = "Query=.+?Effective search space used: \d+"

''' <summary>
''' Dealing with the file size large than 2GB
''' </summary>
''' <param name="LogFile"></param>
''' <returns></returns>
''' <remarks></remarks>
Public Shared Function TryParseUltraLarge(LogFile As String, Optional Encoding As System.Text.Encoding = Nothing) As v228
    Call Console.WriteLine("Regular Expression parsing blast output...")

    'The default text encoding of the blast log is utf8
    If Encoding Is Nothing Then Encoding = System.Text.Encoding.UTF8

    Using p As Microsoft.VisualBasic.ConsoleProcessBar = New ConsoleProcessBar
       Call p.Start()

       Dim TextReader As IO.FileStream = New IO.FileStream(LogFile, IO.FileMode.Open)
       Dim ChunkBuffer As Byte() = New Byte(CHUNK_SIZE - 1) {}
       Dim LastIndex As String = ""
       'Dim Sections As List(Of String) = New List(Of String)
       Dim SectionChunkBuffer As List(Of String) = New List(Of String)

       Do While TextReader.Position < TextReader.Length
          Dim Delta As Integer = TextReader.Length - TextReader.Position

          If Delta < CHUNK_SIZE Then ChunkBuffer = New Byte(Delta - 1) {}

          Call TextReader.Read(ChunkBuffer, 0, ChunkBuffer.Count - 1)

          Dim SourceText As String = Encoding.GetString(ChunkBuffer)

          If Not String.IsNullOrEmpty(LastIndex) Then
             SourceText = LastIndex & SourceText
          End If

          Dim i_LastIndex As Integer = InStrRev(SourceText, "Effective search space used:")
          If i_LastIndex = -1 Then  '当前区间之中没有一个完整的Section
             LastIndex &= SourceText
             Continue Do
          Else
             i_LastIndex += 42

             If Not i_LastIndex >= Len(SourceText) Then
                LastIndex = Mid(SourceText, i_LastIndex)  'There are some text in the last of this chunk is the part of the section in the next chunk.
             Else
                LastIndex = ""
             End If
             Call SectionChunkBuffer.Add(SourceText)
          End If

          'This part of the code is non-parallel

          'Dim SectionsTempChunk = (From matche As Match
          '                         In Regex.Matches(SourceText, BLAST_QUERY_HIT_SECTION, RegexOptions.Singleline + RegexOptions.IgnoreCase)
          '                         Select matche.Value).ToArray

          'If SectionsTempChunk.IsNullOrEmpty Then
          '    LastIndex &= SourceText
          '    Continue Do
          'Else
          '    Call Sections.AddRange(SectionsTempChunk)
          'End If

          'LastIndex = SectionsTempChunk.Last()
          'Dim Last_idx As Integer = InStr(SourceText, LastIndex) + Len(LastIndex) + 1
          'If Not Last_idx >= Len(SourceText) Then
          '    LastIndex = Mid(SourceText, Last_idx)  'There are some text in the last of this chunk is the part of the section in the next chunk.
          'Else
          '    LastIndex = ""
          'End If
      Loop

      Call Console.WriteLine("Loading job done, start to regex parsing!")

      'The regular expression parsing function just single thread, here using parallel to parsing the cache data can speed up the regular expression parsing job when dealing with the ultra large text file.
       Dim Sections As String() = (From strLine As String 
                                   In SectionChunkBuffer.AsParallel
                                   Select (From matche As Match
                                           In Regex.Matches(strLine, BLAST_QUERY_HIT_SECTION, RegexOptions.Singleline + RegexOptions.IgnoreCase)
                                           Select matche.Value).ToArray).ToArray.MatrixToVector

       Call Console.WriteLine("Parsing job done!")
       '#Const DEBUG = 1
       Dim Sw As Stopwatch = Stopwatch.StartNew
#If DEBUG Then
       Dim LQuery = (From Line As String In Sections Let query = v228.Query.TryParse(Line) Select query Order By query.QueryName Ascending).ToArray
#Else
       Dim LQuery = (From Line As String In Sections.AsParallel Let query = v228.Query.TryParse(Line) Select query Order By query.QueryName Ascending).ToArray
#End If
       Dim BLASTOutput As v228 = New v228 With {.FilePath = LogFile & ".xml", .Queries = LQuery}
       Console.WriteLine("BLASTOutput file loaded: {0}ms", Sw.ElapsedMilliseconds)

       Return BLASTOutput
   End Using
End Function


这篇关于超大文本文件解析(大小超过100GB)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆