读取大文本文件很慢 [英] Reading large text file very slow

查看:83
本文介绍了读取大文本文件很慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我的任务是编写一个 vb 程序,我在其中读取一个大的 .txt 文件(从 500mb 到 2GB 的任何地方),这个文件通常以 13 位数字开头,然后在每行之后加载其他信息.(例如1578597500548 info info info info 等")我必须让用户输入一个 13 位数字,然后我的程序在每行开头搜索该数字的大文件,如果找到,则将整行写入新的 .txt文件!

so I have been given a task of writing a vb program where I read in a large .txt file (anywhere from 500mb to 2GB) and this files usually starts with a 13 digit number then loads of other info after per line. (e.g "1578597500548 info info info info etc.") I must let a user enter a 13 digit number and then my program search's the large file for that number at beginning of each line and if its found write the full line into a new .txt file!

我当前的程序运行良好,但我注意到我添加到列表/流阅读器部分占用了大约 90% 的处理时间.每次运行平均约 27 秒.任何想法如何加速?这是我写的.

My current program works perfectly but I'm noticing my adding to the list/streamreader part takes up around 90% of the process time. Averaging around 27secs per run. Any ideas how to speed up? Here's what I have written.

Private Sub Button2_Click(sender As Object, e As EventArgs) Handles Button2.Click
    Dim wtr As IO.StreamWriter
    Dim listy As New List(Of String)
    Dim i = 0

    stpw.Reset()
    stpw.Start()

    'reading in file of large data 700mb and larger
    Using Reader As New StreamReader("G:\USER\FOLDER\tester.txt")
        While Reader.EndOfStream = False
            listy.Add(Reader.ReadLine)
        End While
    End Using

    'have a textbox which finds user query number
    Dim result = From n In listy
                 Where n.StartsWith(TextBox1.Text)
                 Select n

    'writes results found into new file
    wtr = New StreamWriter("G:\USER\searched-number.txt")
    For Each word As String In result
        wtr.WriteLine(word)
    Next
    wtr.Close()

    stpw.Stop()
    Debug.WriteLine(stpw.Elapsed.TotalMilliseconds)

    Application.Exit()
End Sub

UPDATE 我已经采纳了一些建议,不要先将它放入列表中,而是在内存中搜索,时间快了大约 5 秒,仍然需要 23 秒完成并写出我正在搜索的数字上方的行,所以如果你能告诉我我哪里出错了.谢谢各位!

UPDATE I've taken some suggestion about not putting it into a list first and just searching on memory, Time is about 5 seconds faster, still takes 23 seconds to complete and also its writing out the line above the digit im searching so if you could please tell me where i'm going wrong. Thanks guys!

wtr = New StreamWriter("G:\Karl\searchednumber.txt")
        Using Reader As New StreamReader("G:\Karl\AC\tester.txt")
            While Reader.EndOfStream = False
                lineIn = Reader.ReadLine
                If Reader.ReadLine.StartsWith(TextBox1.Text) Then
                    wtr.WriteLine(lineIn)

                Else

                    Continue While
                End If
            End While
            wtr.Close()
        End Using

推荐答案

在程序加载时索引文件.

Index the file when the program loads.

创建一个Dictionary(Of ULong, Long),并在程序加载时读取文件.对于每一行,在字典中添加一个条目,将每行前面的 13 位值显示为 ULong 键,并将文件流中的位置显示为 Long 值.

Create a Dictionary(Of ULong, Long), and when the program loads read through the file. For each line, add an entry to the dictionary showing the 13 digit value at the front of each line as the ULong key and the position in the file stream as the Long value.

然后,当用户输入密钥时,您几乎可以立即查看字典以找到您需要的磁盘上的确切位置并直接查找.

Then, when a user puts in a key, you can check the dictionary, which will be almost instant, to find the exact location on disk you need and seek there directly.

在程序启动时构建文件索引可能需要一些时间,但您只需一次.现在,您要么需要在用户每次进行搜索时搜索整个内容,要么在内存中保留数百兆字节的文本文件数据.一旦你有了索引,在字典中查找一个值然后直接寻找它应该几乎立即发生.

Building the file index at program start may take a few moments, but you'll only ever have to do it once. Right now, you either need to search through the entire thing every time a user wants to do a search, or keep several hundred megabytes of text file data in memory. Once you have the index, looking up a value in the dictionary and then seeking directly to it should appear to happen almost instantly.

我刚看到这条评论:

13 位数字可能出现超过 1 次,因此必须搜索整个文件.

there could be more than 1 occurrences of a 13 digit number so must search the whole file.

基于此,索引应该是一个Dictionary(Of ULong, List(Of Long)),其中向条目添加值首先会创建一个列表实例,如果尚不存在,然后将新值添加到列表中.

Based on that, the index should be a Dictionary(Of ULong, List(Of Long)), where adding a value to entry first creates a list instance if one doesn't already exist, then adds the new value to the list.

这是在没有测试数据或 Visual Studio 帮助的情况下直接输入回复窗口的基本尝试,因此可能仍包含几个错误:

Here's a basic attempt typed directly into the reply window without the aid of testing data or Visual Studio that likely therefore still contains several bugs:

Public Class MyFileIndexer
    Private initialCapacity As Integer = 1
    Private Property FilePath As String
    Private Index As Dictionary(Of ULong, List(Of Long))

    Public Sub New(filePath As String)
        Me.FilePath = filePath
        RebuildIndex()
    End Sub

    Public Sub RebuildIndex()
        Index = New Dictionary(Of ULong, List(Of Long))()

        Using sr As New StreamReader(FilePath)
            Dim Line As String = sr.ReadLine()
            Dim position As Long = 0
            While Line IsNot Nothing

                'Process this line
                If Line.Length > 13 Then
                   Dim key As ULong = ULong.Parse(Line.SubString(0, 13))
                   Dim item As List(Of Long)
                   If Not Index.TryGetValue(key, item) Then
                       item = New List(Of Long)(initialCapacity)
                       Index.Add(key, item)
                   End If

                   item.Add(position)
                End If

                'Prep for next line
                position = sr.BaseStream.Position
                Line = sr.ReadLine()
            End While
        End Using   
    End Sub

    'Expect key to be a 13-character numeric string
    Public Function Search(key As String) As List(Of String)
        'Will throw an exception if parsing fails. Be prepared for that.
        Dim realKey As ULong = ULong.Parse(key)
        Return Search(realKey)
    End Function

    Public Function Search(key As ULong) As List(Of String)
        Dim lines As List(Of Long)
        If Not Index.TryGetValue(key, lines) Then Return Nothing

        Dim result As New List(Of String)()
        Using sr As New StreamReader(FilePath)
            For Each position As Long In lines
                sr.BaseStream.Seek(position, SeekOrigin.Begin)
                result.Add(sr.ReadLine())
            Next position
        End Using
        Return Result
    End Function
End Class

'Somewhere public, when your application starts up:
Public Index As New MyFileIndexer("G:\USER\FOLDER\tester.txt")

Private Sub Button2_Click(sender As Object, e As EventArgs) Handles Button2.Click
    Dim lines As List(Of String) = Nothing
    Try
        lines = Index.Search(TextBox1.Text)
    Catch
        'Do something here
    End Try

    If lines IsNot Nothing Then
        Using sw As New StreamWriter($"G:\USER\{TextBox1.Text}.txt")
            For Each line As String in lines
                 sw.WriteLine(line)
            Next 
        End Using
    End If
End Sub

为了好玩,这里有一个通用版本的类,它允许您提供自己的键选择器函数来索引任何文件,该文件每行存储一个键,我认为这通常对,比如说,更大的 csv 数据集.

And for fun, here's a generic version of the class that lets you supply your own key selector function to index any file that stores a key with each line, which I could see being generally useful for, say, larger csv data sets.

Public Class MyFileIndexer(Of TKey)
    Private initialCapacity As Integer = 1
    Private Property FilePath As String
    Private Index As Dictionary(Of TKey, List(Of Long))
    Private GetKey As Func(Of String, TKey) 

    Public Sub New(filePath As String, Func(Of String, TKey) keySelector)
        Me.FilePath = filePath
        Me.GetKey = keySelector
        RebuildIndex()
    End Sub

    Public Sub RebuildIndex()
        Index = New Dictionary(Of TKey, List(Of Long))()

        Using sr As New StreamReader(FilePath)
            Dim Line As String = sr.ReadLine()
            Dim position As Long = 0
            While Line IsNot Nothing

               Dim key As TKey = GetKey(Line)
               Dim item As List(Of Long)
               If Not Index.TryGetValue(key, item) Then
                   item = New List(Of Long)(initialCapacity)
                   Index.Add(key, item)
               End If   
               item.Add(position)

                'Prep for next line
                position = sr.BaseStream.Position
                Line = sr.ReadLine()
            End While
        End Using   
    End Sub

    Public Function Search(key As TKey) As List(Of String)
        Dim lines As List(Of Long)
        If Not Index.TryGetValue(key, lines) Then Return Nothing

        Dim result As New List(Of String)()
        Using sr As New StreamReader(FilePath)
            For Each position As Long In lines
                sr.BaseStream.Seek(position, SeekOrigin.Begin)
                result.Add(sr.ReadLine())
            Next position
        End Using
        Return Result
    End Function
End Class

这篇关于读取大文本文件很慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆