从文本文件中过滤某些行的最快方法是什么 [英] What is the fastest way to filter certain lines from text file
问题描述
我是C#的新手。
我需要分析一个包含500.000+行的日志文件。
我需要过滤包含af特定关键字的行并将其存储在内存中以便进一步处理。
这些行具有固定的布局,因此关键字将位于所有行中的位置相同。
在C#中执行此操作的最快方法是什么?
我有在Visual Basic中使用TextFieldParser做了类似的事情但是它需要很长时间并且想知道是否有更快的方法。
I am new to C#.
I need to analyze a log file containing 500.000+ lines.
I need to filter lines containing af specific keyword and store those in memory for further processing.
The lines has a fixed layout so the keyword will be at the same position in all lines.
What is the fastest method of doing this in C#
I have done something like this with TextFieldParser in Visual Basic but it takes a long time and wonder if there is a faster way.
推荐答案
你可能得到了答案,但我可能会喜欢更快的方法。我不得不承认,我有大约18万行的测试数据。不过,可能值得一试。我做了一个小测试,比较了ReadLines和我基于MemoryMappedFile的方法。
You probably got your answers, but I might have fond an even faster method. I have to admit, I had oly 18MB of test data with around 225k lines. Still, might worth giving a try. I made a small test comparing ReadLines, and my MemoryMappedFile based approach.
using System;
using System.IO;
using System.IO.MemoryMappedFiles;
using System.Collections.Generic;
using System.Text;
namespace MMText
{
public class MemoryMappedTextFileReader:IDisposable
{
MemoryMappedFile memoryMappedFile;
public MemoryMappedTextFileReader(string fileName)
{
memoryMappedFile = MemoryMappedFile.CreateFromFile(fileName, FileMode.Open);
}
public IEnumerable<string> ReadLines()
{
using (var memoryMappedViewStream = memoryMappedFile.CreateViewStream())
{
using (StreamReader sr = new StreamReader(memoryMappedViewStream, UTF8Encoding.UTF8, true, 4096)) {
while (!sr.EndOfStream) {
String line = sr.ReadLine();
yield return line;
}
}
}
}
#region IDisposable implementation
bool disposed = false;
public void Dispose()
{
Dispose(true);
GC.SuppressFinalize(this);
}
protected virtual void Dispose(bool disposing)
{
if (disposed)
return;
if (disposing) {
memoryMappedFile.Dispose();
}
disposed = true;
}
#endregion
}
}
测试:
And the test:
using System;
using System.IO;
using System.Diagnostics;
namespace MMText
{
class Program
{
public static void Main(string[] args)
{
long lines = 0;
const string fileName = @"D:\TEMP\setupapi.dev.20140929_185959.log";
var watch = Stopwatch.StartNew();
foreach (var s in File.ReadLines(fileName))
{
lines++;
}
watch.Stop();
TimeSpan ts = watch.Elapsed;
string elapsedTime = String.Format("{0:00}:{1:00}:{2:00}.{3:00}", ts.Hours, ts.Minutes, ts.Seconds, ts.Milliseconds / 10);
Console.WriteLine("ReadLines - Reading {0} lines took: {1}. Average: {2} ms/line", lines, elapsedTime, 1.0f*watch.ElapsedMilliseconds/lines);
lines = 0;
watch = Stopwatch.StartNew();
using(var x = new MemoryMappedTextFileReader(fileName))
{
foreach(var s in x.ReadLines())
{
lines++;
}
}
watch.Stop();
ts = watch.Elapsed;
elapsedTime = String.Format("{0:00}:{1:00}:{2:00}.{3:00}", ts.Hours, ts.Minutes, ts.Seconds, ts.Milliseconds / 10);
Console.WriteLine("MMF - Reading {0} lines took: {1}. Average: {2} ms/line", lines, elapsedTime, 1.0f*watch.ElapsedMilliseconds/lines);
Console.Write("Press any key to continue . . . ");
Console.ReadKey(true);
}
}
}
结果如下:
Here are the results:
ReadLines - Reading 225661 lines took: 00:00:00.35. Average: 0,001564293 ms/line
MMF - Reading 225662 lines took: 00:00:00.29. Average: 0,001320559 ms/line
可能因运行而异,但比率相同。您可能已经注意到1行的差异。有趣。用FAR经理编辑打开它显示225662 ......所以我不知道那里缺少什么ReadLines ...
仍然需要小心MMF ,如果你走这条路,你也应该读这个: http://blogs.msdn.com/b/bclteam/archive/2011/06/06/memory-mapped-file-quirks.aspx [ ^ ]
[更新:添加了内存使用测试]
我已经更新了这样的测试应用程序:
Might differ from run to run, but the ratio is the same. You might have noticed the difference of 1 line. Interesting. Opening it with FAR manager's editor shows 225662... so I don't know what ReadLines is missing there...
Still, one has to be carefull with MMF, if you take this path, you should read this also: http://blogs.msdn.com/b/bclteam/archive/2011/06/06/memory-mapped-file-quirks.aspx[^]
[Update: added memory usage tests]
I have updated the test application like this:
public static void Main(string[] args)
{
AppDomain.MonitoringIsEnabled = true;
long lines = 0;
const string fileName = @"D:\TEMP\setupapi.dev.20140929_185959.log";
var watch = Stopwatch.StartNew();
long frl_MU_b = AppDomain.CurrentDomain.MonitoringTotalAllocatedMemorySize;
foreach (var s in File.ReadLines(fileName))
{
lines++;
}
long frl_MU_a = AppDomain.CurrentDomain.MonitoringTotalAllocatedMemorySize;
watch.Stop();
TimeSpan ts = watch.Elapsed;
string elapsedTime = String.Format("{0:00}:{1:00}:{2:00}.{3:00}", ts.Hours, ts.Minutes, ts.Seconds, ts.Milliseconds / 10);
Console.WriteLine("ReadLines - Reading {0} lines took: {1}. Average: {2} ms/line. Memory usage: {3}", lines, elapsedTime, 1.0f*watch.ElapsedMilliseconds/lines, frl_MU_a-frl_MU_b);
lines = 0;
watch = Stopwatch.StartNew();
long mmf_MU_b = AppDomain.CurrentDomain.MonitoringTotalAllocatedMemorySize;
using(var x = new MemoryMappedTextFileReader(fileName))
{
foreach(var s in x.ReadLines())
{
lines++;
}
}
long mmf_MU_a = AppDomain.CurrentDomain.MonitoringTotalAllocatedMemorySize;
watch.Stop();
ts = watch.Elapsed;
elapsedTime = String.Format("{0:00}:{1:00}:{2:00}.{3:00}", ts.Hours, ts.Minutes, ts.Seconds, ts.Milliseconds / 10);
Console.WriteLine("MMF - Reading {0} lines took: {1}. Average: {2} ms/line. Memory usage: {3}", lines, elapsedTime, 1.0f*watch.ElapsedMilliseconds/lines, mmf_MU_a-mmf_MU_b);
Console.Write("Press any key to continue . . . ");
Console.ReadKey(true);
}
以下是结果:
And here are the results:
ReadLines - Reading 225661 lines took: 00:00:00.36. Average: 0,001613039 ms/line. Memory usage: 41667828
MMF - Reading 225662 lines took: 00:00:00.35. Average: 0,001586443 ms/line. Memory usage: 37764368
如你所见,MMF apprach消耗的内存更少。
As you can see, the MMF apprach consumes even less memory.
阅读:http://stackoverflow.com/questions/8037070/什么是最快的方式阅读文本文件逐行 [ ^ ]
这篇关于从文本文件中过滤某些行的最快方法是什么的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!