在 C# 中使用流读取大型文本文件 [英] Reading large text files with streams in C#

查看:35
本文介绍了在 C# 中使用流读取大型文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一项可爱的任务,就是研究如何处理加载到我们应用程序脚本编辑器中的大文件(就像 VBA 用于我们用于快速宏的内部产品).大多数文件大约为 300-400 KB,可以很好地加载.但是当它们超过 100MB 时,这个过程就会变得很困难(正如您所料).

I've got the lovely task of working out how to handle large files being loaded into our application's script editor (it's like VBA for our internal product for quick macros). Most files are about 300-400 KB which is fine loading. But when they go beyond 100 MB the process has a hard time (as you'd expect).

会发生什么情况是文件被读取并推入 RichTextBox 中,然后进行导航 - 这部分不要太担心.

What happens is that the file is read and shoved into a RichTextBox which is then navigated - don't worry too much about this part.

编写初始代码的开发人员只是使用 StreamReader 并在做

The developer who wrote the initial code is simply using a StreamReader and doing

[Reader].ReadToEnd()

这可能需要很长时间才能完成.

which could take quite a while to complete.

我的任务是分解这段代码,将其分块读入缓冲区,并显示一个带有取消选项的进度条.

My task is to break this bit of code up, read it in chunks into a buffer and show a progressbar with an option to cancel it.

一些假设:

  • 大多数文件大小为 30-40MB
  • 文件的内容是文本(不是二进制),有些是 Unix 格式,有些是 DOS.
  • 一旦检索到内容,我们就会确定使用了什么终止符.
  • 一旦加载完毕,在 Richtextbox 中呈现所需的时间就无人关心了.这只是文本的初始加载.

现在是问题:

  • 我可以简单地使用 StreamReader,然后检查 Length 属性(所以是 ProgressMax)并针对设置的缓冲区大小发出 Read 并在后台工作程序中的 while 循环 WHILST 中迭代,所以它不会不阻塞主 UI 线程?完成后,将 stringbuilder 返回到主线程.
  • 内容将转到 StringBuilder.如果长度可用,我可以使用流的大小初始化 StringBuilder 吗?
  • Can I simply use StreamReader, then check the Length property (so ProgressMax) and issue a Read for a set buffer size and iterate through in a while loop WHILST inside a background worker, so it doesn't block the main UI thread? Then return the stringbuilder to the main thread once it's completed.
  • The contents will be going to a StringBuilder. can I initialise the StringBuilder with the size of the stream if the length is available?

这些(在您的专业意见中)是好主意吗?过去我在从 Streams 读取内容时遇到了一些问题,因为它总是会错过最后几个字节或其他东西,但如果是这种情况,我会问另一个问题.

Are these (in your professional opinions) good ideas? I've had a few issues in the past with reading content from Streams, because it will always miss the last few bytes or something, but I'll ask another question if this is the case.

推荐答案

您可以通过使用 BufferedStream 来提高读取速度,如下所示:

You can improve read speed by using a BufferedStream, like this:

using (FileStream fs = File.Open(path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (BufferedStream bs = new BufferedStream(fs))
using (StreamReader sr = new StreamReader(bs))
{
    string line;
    while ((line = sr.ReadLine()) != null)
    {

    }
}

2013 年 3 月更新

我最近编写了用于读取和处理(在其中搜索文本)1GB 大小的文本文件(比此处涉及的文件大得多)的代码,并通过使用生产者/消费者模式获得了显着的性能提升.生产者任务使用 BufferedStream 读取文本行,并将它们交给一个单独的消费者任务进行搜索.

I recently wrote code for reading and processing (searching for text in) 1 GB-ish text files (much larger than the files involved here) and achieved a significant performance gain by using a producer/consumer pattern. The producer task read in lines of text using the BufferedStream and handed them off to a separate consumer task that did the searching.

我以此为契机学习 TPL Dataflow,它非常适合快速编码此模式.

I used this as an opportunity to learn TPL Dataflow, which is very well suited for quickly coding this pattern.

为什么 BufferedStream 更快

缓冲区是内存中用于缓存数据的字节块,从而减少对操作系统的调用次数.缓冲区提高了读写性能.缓冲区可用于读取或写入,但不能同时用于两者.BufferedStream 的 Read 和 Write 方法自动维护缓冲区.

A buffer is a block of bytes in memory used to cache data, thereby reducing the number of calls to the operating system. Buffers improve read and write performance. A buffer can be used for either reading or writing, but never both simultaneously. The Read and Write methods of BufferedStream automatically maintain the buffer.

2014 年 12 月更新:您的里程可能会有所不同

根据评论,FileStream 应该使用 BufferedStream 内部.在首次提供此答案时,我通过添加 BufferedStream 测量到显着的性能提升.当时我的目标是 32 位平台上的 .NET 3.x.今天,针对 64 位平台上的 .NET 4.5,我没有看到任何改进.

Based on the comments, FileStream should be using a BufferedStream internally. At the time this answer was first provided, I measured a significant performance boost by adding a BufferedStream. At the time I was targeting .NET 3.x on a 32-bit platform. Today, targeting .NET 4.5 on a 64-bit platform, I do not see any improvement.

相关

我遇到过这样一种情况:将生成的大型 CSV 文件从 ASP.Net MVC 操作流式传输到响应流非常慢.在这种情况下,添加 BufferedStream 将性能提高了 100 倍.如需更多信息,请参阅无缓冲输出非常慢

I came across a case where streaming a large, generated CSV file to the Response stream from an ASP.Net MVC action was very slow. Adding a BufferedStream improved performance by 100x in this instance. For more see Unbuffered Output Very Slow

这篇关于在 C# 中使用流读取大型文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆