在Powershell中,按记录类型分割大文本文件的最有效方法是什么? [英] In Powershell, what's the most efficient way to split a large text file by record type?

查看:72
本文介绍了在Powershell中,按记录类型分割大文本文件的最有效方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Powershell进行一些ETL工作,读取压缩的文本文件,然后根据每行的前三个字符将其拆分.

I am using Powershell for some ETL work, reading compressed text files in and splitting them out depending on the first three characters of each line.

如果我只是过滤输入文件,则可以将过滤后的流传输到Out-File并完成处理.但是我需要将输出重定向到多个目标,据我所知,这无法通过简单的管道完成.我已经在使用.NET流读取器来读取压缩的输入文件,并且想知道是否还需要使用流写入器来编写输出文件.

If I were just filtering the input file, I could pipe the filtered stream to Out-File and be done with it. But I need to redirect the output to more than one destination, and as far as I know this can't be done with a simple pipe. I'm already using a .NET streamreader to read the compressed input files, and I'm wondering if I need to use a streamwriter to write the output files as well.

天真的版本看起来像这样:

The naive version looks something like this:

while (!$reader.EndOfFile) {
  $line = $reader.ReadLine();
  switch ($line.substring(0,3) {
    "001" {Add-Content "output001.txt" $line}
    "002" {Add-Content "output002.txt" $line}
    "003" {Add-Content "output003.txt" $line}
    }
  }

这似乎是个坏消息:每行一次查找,打开,写入和关闭文件.输入文件是500MB以上的巨大怪物.

That just looks like bad news: finding, opening, writing and closing a file once per row. The input files are huge 500MB+ monsters.

是否有惯用的方法使用Powershell构造有效地处理此问题,还是应该转向.NET流编写器?

Is there an idiomatic way to handle this efficiently w/ Powershell constructs, or should I turn to the .NET streamwriter?

我是否可以使用(New-Item"path" -type"file")对象的方法?

Are there methods of a (New-Item "path" -type "file") object I could use for this?

编辑上下文:

我正在使用 DotNetZip 库将ZIP文件读取为流;因此streamreader而不是Get-Content/gc.示例代码:

I'm using the DotNetZip library to read ZIP files as streams; thus streamreader rather than Get-Content/gc. Sample code:

[System.Reflection.Assembly]::LoadFrom("\Path\To\Ionic.Zip.dll") 
$zipfile = [Ionic.Zip.ZipFile]::Read("\Path\To\File.zip")

foreach ($entry in $zipfile) {
  $reader = new-object system.io.streamreader $entry.OpenReader();
  while (!$reader.EndOfFile) {
    $line = $reader.ReadLine();
    #do something here
  }
}

我应该同时Dispose() $ zipfile和$ reader,但这是另一个问题!

I should probably Dispose() of both the $zipfile and $reader, but that is for another question!

推荐答案

阅读

对于读取文件并进行解析,我将使用switch语句:

switch -file c:\temp\stackoverflow.testfile2.txt -regex {
  "^001" {Add-Content c:\temp\stackoverflow.testfile.001.txt $_}
  "^002" {Add-Content c:\temp\stackoverflow.testfile.002.txt $_}
  "^003" {Add-Content c:\temp\stackoverflow.testfile.003.txt $_}
}

我认为这是更好的方法,因为

I think it is better approach because

  • 有对正则表达式的支持,您没有 必须制作子字符串(这可能 昂贵)和
  • 参数 -file非常方便;)
  • there is support for regex, you don't have to make substring (which might be expensive) and
  • the parameter -file is quite handy ;)

在编写输出方面,我将测试使用streamwriter,但是如果Add-Content的性能对您来说不错,那么我会坚持下去.

As for writing the output, I'll test to use streamwriter, however if performance of Add-Content is decent for you, I would stick to it.

已添加: Keith建议使用>>运算符,但是,这似乎非常慢.除此之外,它还以Unicode格式写入输出,从而使文件大小增加了一倍.

Added: Keith proposed to use >> operator, however, it seems that it is very slow. Besides that it writes output in Unicode which doubles the file size.

看看我的测验:

[1]: (measure-command {
>>     gc c:\temp\stackoverflow.testfile2.txt  | %{$c = $_; switch ($_.Substring(0,3)) {
>>             '001'{$c >> c:\temp\stackoverflow.testfile.001.txt} `
>>             '002'{$c >> c:\temp\stackoverflow.testfile.002.txt} `
>>             '003'{$c >> c:\temp\stackoverflow.testfile.003.txt}}}
>> }).TotalSeconds
>>
159,1585874
[2]: (measure-command {
>>     gc c:\temp\stackoverflow.testfile2.txt  | %{$c = $_; switch ($_.Substring(0,3)) {
>>             '001'{$c | Add-content c:\temp\stackoverflow.testfile.001.txt} `
>>             '002'{$c | Add-content c:\temp\stackoverflow.testfile.002.txt} `
>>             '003'{$c | Add-content c:\temp\stackoverflow.testfile.003.txt}}}
>> }).TotalSeconds
>>
9,2696923

区别是巨大.

仅供比较:

[3]: (measure-command {
>>     $reader = new-object io.streamreader c:\temp\stackoverflow.testfile2.txt
>>     while (!$reader.EndOfStream) {
>>         $line = $reader.ReadLine();
>>         switch ($line.substring(0,3)) {
>>             "001" {Add-Content c:\temp\stackoverflow.testfile.001.txt $line}
>>             "002" {Add-Content c:\temp\stackoverflow.testfile.002.txt $line}
>>             "003" {Add-Content c:\temp\stackoverflow.testfile.003.txt $line}
>>             }
>>         }
>>     $reader.close()
>> }).TotalSeconds
>>
8,2454369
[4]: (measure-command {
>>     switch -file c:\temp\stackoverflow.testfile2.txt -regex {
>>         "^001" {Add-Content c:\temp\stackoverflow.testfile.001.txt $_}
>>         "^002" {Add-Content c:\temp\stackoverflow.testfile.002.txt $_}
>>         "^003" {Add-Content c:\temp\stackoverflow.testfile.003.txt $_}
>>     }
>> }).TotalSeconds
8,6755565

补充:我对写作表现很好奇..我有点惊讶

Added: I was curious about the writing performance .. and I was a little bit surprised

[8]: (measure-command {
>>     $sw1 = new-object io.streamwriter c:\temp\stackoverflow.testfile.001.txt3b
>>     $sw2 = new-object io.streamwriter c:\temp\stackoverflow.testfile.002.txt3b
>>     $sw3 = new-object io.streamwriter c:\temp\stackoverflow.testfile.003.txt3b
>>     switch -file c:\temp\stackoverflow.testfile2.txt -regex {
>>         "^001" {$sw1.WriteLine($_)}
>>         "^002" {$sw2.WriteLine($_)}
>>         "^003" {$sw3.WriteLine($_)}
>>     }
>>     $sw1.Close()
>>     $sw2.Close()
>>     $sw3.Close()
>>
>> }).TotalSeconds
>>
0,1062315

速度快了 80倍. 现在,您必须决定-如果速度很重要,请使用StreamWriter.如果代码的清晰度很重要,请使用Add-Content.

It is 80 times faster. Now you you have to decide - if speed is important, use StreamWriter. If code clarity is important, use Add-Content.

根据Keith Substring,速度提高了20%.一如既往,这取决于.但是,就我而言,结果是这样的:

According to Keith Substring is 20% faster. It depends, as always. However, in my case the results are like this:

[102]: (measure-command {
>>     gc c:\temp\stackoverflow.testfile2.txt  | %{$c = $_; switch ($_.Substring(0,3)) {
>>             '001'{$c | Add-content c:\temp\stackoverflow.testfile.001.s.txt} `
>>             '002'{$c | Add-content c:\temp\stackoverflow.testfile.002.s.txt} `
>>             '003'{$c | Add-content c:\temp\stackoverflow.testfile.003.s.txt}}}
>> }).TotalSeconds
>>
9,0654496
[103]: (measure-command {
>>     gc c:\temp\stackoverflow.testfile2.txt  | %{$c = $_; switch -regex ($_) {
>>             '^001'{$c | Add-content c:\temp\stackoverflow.testfile.001.r.txt} `
>>             '^002'{$c | Add-content c:\temp\stackoverflow.testfile.002.r.txt} `
>>             '^003'{$c | Add-content c:\temp\stackoverflow.testfile.003.r.txt}}}
>> }).TotalSeconds
>>
9,2563681

所以区别并不重要,对我而言,正则表达式更具可读性.

So the difference is not important and for me, regexes are more readable.

这篇关于在Powershell中,按记录类型分割大文本文件的最有效方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆