从头开始读取海量文本文件 [英] Start reading massive text file from the end

查看:19
本文介绍了从头开始读取海量文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我会问你是否可以为我的问题提供一些替代方案.

I would ask if you could give me some alternatives in my problems.

基本上我正在阅读一个平均为 800 万行的 .txt 日志文件.大约 600megs 的纯原始 txt 文件.

basically I'm reading a .txt log file averaging to 8 million lines. Around 600megs of pure raw txt file.

我目前正在使用 streamreader 对这 800 万行进行 2 次传递,对日志文件中的重要部分进行排序和过滤,但为此,我的计算机需要大约 50 秒才能完成 1 次完整运行.

I'm currently using streamreader to do 2 passes on those 8 million lines doing sorting and filtering important parts in the log file, but to do so, My computer is taking ~50sec to do 1 complete run.

我可以优化它的一种方法是在最后开始读取第一遍,因为最重要的数据大约位于最后的 200k 行.不幸的是,我进行了搜索,streamreader 无法做到这一点.有什么想法可以做到这一点吗?

One way that I can optimize this is to make the first pass to start reading at the end because the most important data is located approximately at the final 200k line(s) . Unfortunately, I searched and streamreader can't do this. Any ideas to do this?

一些一般限制

  • 行数不同
  • 文件大小不同
  • 重要数据的位置各不相同,但大约在最后 20 万行

这是第一遍日志文件的循环代码,只是为了给你一个想法

Here's the loop code for the first pass of the log file just to give you an idea

Do Until sr.EndOfStream = True                                                                              'Read whole File
            Dim streambuff As String = sr.ReadLine                                                      'Array to Store CombatLogNames
            Dim CombatLogNames() As String
            Dim searcher As String

    If streambuff.Contains("CombatLogNames flags:0x1") Then                                             'Keyword to Filter CombatLogNames Packets in the .txt

        Dim check As String = streambuff                                                                'Duplicate of the Line being read
        Dim index1 As Char = check.Substring(check.IndexOf("(") + 1)                                    '
        Dim index2 As Char = check.Substring(check.IndexOf("(") + 2)                                    'Used to bypass the first CombatLogNames packet that contain only 1 entry


        If (check.IndexOf("(") <> -1 And index1 <> "" And index2 <> " ") Then                           'Stricter Filters for CombatLogNames

            Dim endCLN As Integer = 0                                                                   'Signifies the end of CombatLogNames Packet
            Dim x As Integer = 0                                                                        'Counter for array

            While (endCLN = 0 And streambuff <> "---- CNETMsg_Tick")                                    'Loops until the end keyword for CombatLogNames is seen

                streambuff = sr.ReadLine                                                                'Reads a new line to flush out "CombatLogNames flags:0x1" which is unneeded
                If ((streambuff.Contains("---- CNETMsg_Tick") = True) Or (streambuff.Contains("ResponseKeys flags:0x0 ") = True)) Then

                    endCLN = 1                                                                          'Value change to determine end of CombatLogName packet

                Else

                    ReDim Preserve CombatLogNames(x)                                                    'Resizes the array while preserving the values
                    searcher = streambuff.Trim.Remove(streambuff.IndexOf("(") - 5).Remove(0, _
                    streambuff.Trim.Remove(streambuff.IndexOf("(")).IndexOf("'"))                       'Additional filtering to get only valuable data
                    CombatLogNames(x) = search(searcher)
                    x += 1                                                                              '+1 to Array counter

                End If
            End While
        Else
            'MsgBox("Something went wrong, Flame the coder of this program!!")                          'Bug Testing code that is disabled
        End If
    Else
    End If

    If (sr.EndOfStream = True) Then

        ReDim GlobalArr(CombatLogNames.Length - 1)                                                      'Resizing the Global array to prime it for copying data
        Array.Copy(CombatLogNames, GlobalArr, CombatLogNames.Length)                                    'Just copying the array to make it global

    End If
Loop

推荐答案

你可以将 BaseStream 设置到所需的读取位置,你只是不能将其设置为特定的 LINE(因为计数行需要读取完整的文件)

You CAN set the BaseStream to the desired reading position, you just cant set it to a specfic LINE (because counting lines requires to read the complete file)

    Using sw As New StreamWriter("foo.txt", False, System.Text.Encoding.ASCII)
        For i = 1 To 100
            sw.WriteLine("the quick brown fox jumps ovr the lazy dog")
        Next

    End Using
    Using sr As New StreamReader("foo.txt", System.Text.Encoding.ASCII)
        sr.BaseStream.Seek(-100, SeekOrigin.End)
        Dim garbage = sr.ReadLine ' can not use, because very likely not a COMPLETE line
        While Not sr.EndOfStream
            Dim line = sr.ReadLine
            Console.WriteLine(line)
        End While
    End Using

对于以后对同一文件的任何读取尝试,您可以简单地保存(基本流的)最终位置,并在开始读取行之前在下一次读取时前进到该位置.

For any later read attempt on the same file, you could simply save the final position (of the basestream) and on the next read to advance to that position before you start reading lines.

这篇关于从头开始读取海量文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆