在PowerShell中对非常大的文本文件进行排序 [英] Sort very large text file in PowerShell

查看:209
本文介绍了在PowerShell中对非常大的文本文件进行排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有标准的Apache日志文件,大小在500Mb和2GB之间.我需要对其中的行进行排序(每行以日期yyyy-MM-dd hh:mm:ss开头,因此无需进行排序处理.

I have standard Apache log files, between 500Mb and 2GB in size. I need to sort the lines in them (each line starts with a date yyyy-MM-dd hh:mm:ss, so no treatment necessary for sorting.

想到的最简单,最显而易见的事情是

The simplest and most obvious thing that comes to mind is

 Get-Content unsorted.txt | sort | get-unique > sorted.txt

我猜(没有尝试过),使用Get-Content进行此操作将永远占用我1GB的文件.我不太了解System.IO.StreamReader的用法,但是我很好奇是否可以使用该方法将有效的解决方案组合在一起?

I am guessing (without having tried it) that doing this using Get-Content would take forever in my 1GB files. I don't quite know my way around System.IO.StreamReader, but I'm curious if an efficient solution could be put together using that?

感谢任何可能有更有效想法的人.

Thanks to anyone who might have a more efficient idea.

我后来尝试了这个,花了很长时间. 400 MB大约需要10分钟.

I tried this subsequently, and it took a very long time; some 10 minutes for 400MB.

推荐答案

Get-Content对于读取大文件非常无效. Sort-Object也不是很快.

Get-Content is terribly ineffective for reading large files. Sort-Object is not very fast, too.

让我们设置一个基线:

$sw = [System.Diagnostics.Stopwatch]::StartNew();
$c = Get-Content .\log3.txt -Encoding Ascii
$sw.Stop();
Write-Output ("Reading took {0}" -f $sw.Elapsed);

$sw = [System.Diagnostics.Stopwatch]::StartNew();
$s = $c | Sort-Object;
$sw.Stop();
Write-Output ("Sorting took {0}" -f $sw.Elapsed);

$sw = [System.Diagnostics.Stopwatch]::StartNew();
$u = $s | Get-Unique
$sw.Stop();
Write-Output ("uniq took {0}" -f $sw.Elapsed);

$sw = [System.Diagnostics.Stopwatch]::StartNew();
$u | Out-File 'result.txt' -Encoding ascii
$sw.Stop();
Write-Output ("saving took {0}" -f $sw.Elapsed);

对于一个40 MB的文件,该文件包含160万行(由100k唯一行重复16次组成),该脚本在我的机器上产生以下输出:

With a 40 MB file having 1.6 million lines (made of 100k unique lines repeated 16 times) this script produces the following output on my machine:

Reading took 00:02:16.5768663
Sorting took 00:02:04.0416976
uniq took 00:01:41.4630661
saving took 00:00:37.1630663

完全不令人印象深刻:超过6分钟即可对小文件进行排序.每个步骤都可以改进很多.让我们使用StreamReader将文件逐行读取到HashSet中,该文件将删除重复项,然后将数据复制到List并在那里进行排序,然后使用StreamWriter将结果转储回去.

Totally unimpressive: more than 6 minutes to sort tiny file. Every step can be improved a lot. Let's use StreamReader to read file line by line into HashSet which will remove duplicates, then copy data to List and sort it there, then use StreamWriter to dump results back.

$hs = new-object System.Collections.Generic.HashSet[string]
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$reader = [System.IO.File]::OpenText("D:\log3.txt")
try {
    while (($line = $reader.ReadLine()) -ne $null)
    {
        $t = $hs.Add($line)
    }
}
finally {
    $reader.Close()
}
$sw.Stop();
Write-Output ("read-uniq took {0}" -f $sw.Elapsed);

$sw = [System.Diagnostics.Stopwatch]::StartNew();
$ls = new-object system.collections.generic.List[string] $hs;
$ls.Sort();
$sw.Stop();
Write-Output ("sorting took {0}" -f $sw.Elapsed);

$sw = [System.Diagnostics.Stopwatch]::StartNew();
try
{
    $f = New-Object System.IO.StreamWriter "d:\result2.txt";
    foreach ($s in $ls)
    {
        $f.WriteLine($s);
    }
}
finally
{
    $f.Close();
}
$sw.Stop();
Write-Output ("saving took {0}" -f $sw.Elapsed);

此脚本产生:

read-uniq took 00:00:32.2225181
sorting took 00:00:00.2378838
saving took 00:00:01.0724802

在同一输入文件上,它的运行速度快10倍以上.我仍然很惊讶,尽管它需要30秒才能从磁盘读取文件.

On same input file it runs more than 10 times faster. I am still surprised though it takes 30 seconds to read file from disk.

这篇关于在PowerShell中对非常大的文本文件进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆