在PowerShell中对非常大的文本文件进行排序 [英] Sort very large text file in PowerShell
问题描述
我有标准的Apache日志文件,大小在500Mb和2GB之间.我需要对其中的行进行排序(每行以日期yyyy-MM-dd hh:mm:ss开头,因此无需进行排序处理.
I have standard Apache log files, between 500Mb and 2GB in size. I need to sort the lines in them (each line starts with a date yyyy-MM-dd hh:mm:ss, so no treatment necessary for sorting.
想到的最简单,最显而易见的事情是
The simplest and most obvious thing that comes to mind is
Get-Content unsorted.txt | sort | get-unique > sorted.txt
我猜(没有尝试过),使用Get-Content
进行此操作将永远占用我1GB的文件.我不太了解System.IO.StreamReader
的用法,但是我很好奇是否可以使用该方法将有效的解决方案组合在一起?
I am guessing (without having tried it) that doing this using Get-Content
would take forever in my 1GB files. I don't quite know my way around System.IO.StreamReader
, but I'm curious if an efficient solution could be put together using that?
感谢任何可能有更有效想法的人.
Thanks to anyone who might have a more efficient idea.
我后来尝试了这个,花了很长时间. 400 MB大约需要10分钟.
I tried this subsequently, and it took a very long time; some 10 minutes for 400MB.
推荐答案
Get-Content
对于读取大文件非常无效. Sort-Object
也不是很快.
Get-Content
is terribly ineffective for reading large files. Sort-Object
is not very fast, too.
让我们设置一个基线:
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$c = Get-Content .\log3.txt -Encoding Ascii
$sw.Stop();
Write-Output ("Reading took {0}" -f $sw.Elapsed);
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$s = $c | Sort-Object;
$sw.Stop();
Write-Output ("Sorting took {0}" -f $sw.Elapsed);
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$u = $s | Get-Unique
$sw.Stop();
Write-Output ("uniq took {0}" -f $sw.Elapsed);
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$u | Out-File 'result.txt' -Encoding ascii
$sw.Stop();
Write-Output ("saving took {0}" -f $sw.Elapsed);
对于一个40 MB的文件,该文件包含160万行(由100k唯一行重复16次组成),该脚本在我的机器上产生以下输出:
With a 40 MB file having 1.6 million lines (made of 100k unique lines repeated 16 times) this script produces the following output on my machine:
Reading took 00:02:16.5768663
Sorting took 00:02:04.0416976
uniq took 00:01:41.4630661
saving took 00:00:37.1630663
完全不令人印象深刻:超过6分钟即可对小文件进行排序.每个步骤都可以改进很多.让我们使用StreamReader
将文件逐行读取到HashSet
中,该文件将删除重复项,然后将数据复制到List
并在那里进行排序,然后使用StreamWriter
将结果转储回去.
Totally unimpressive: more than 6 minutes to sort tiny file. Every step can be improved a lot. Let's use StreamReader
to read file line by line into HashSet
which will remove duplicates, then copy data to List
and sort it there, then use StreamWriter
to dump results back.
$hs = new-object System.Collections.Generic.HashSet[string]
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$reader = [System.IO.File]::OpenText("D:\log3.txt")
try {
while (($line = $reader.ReadLine()) -ne $null)
{
$t = $hs.Add($line)
}
}
finally {
$reader.Close()
}
$sw.Stop();
Write-Output ("read-uniq took {0}" -f $sw.Elapsed);
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$ls = new-object system.collections.generic.List[string] $hs;
$ls.Sort();
$sw.Stop();
Write-Output ("sorting took {0}" -f $sw.Elapsed);
$sw = [System.Diagnostics.Stopwatch]::StartNew();
try
{
$f = New-Object System.IO.StreamWriter "d:\result2.txt";
foreach ($s in $ls)
{
$f.WriteLine($s);
}
}
finally
{
$f.Close();
}
$sw.Stop();
Write-Output ("saving took {0}" -f $sw.Elapsed);
此脚本产生:
read-uniq took 00:00:32.2225181
sorting took 00:00:00.2378838
saving took 00:00:01.0724802
在同一输入文件上,它的运行速度快10倍以上.我仍然很惊讶,尽管它需要30秒才能从磁盘读取文件.
On same input file it runs more than 10 times faster. I am still surprised though it takes 30 seconds to read file from disk.
这篇关于在PowerShell中对非常大的文本文件进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!