将没有分隔符和100+列的4 GB固定列宽文本文件转换为修剪的制表符分隔的文件 [英] Convert 4 GB fixed column width text file with no delimiters and 100+ columns to a trimmed, tab delimited file

查看:55
本文介绍了将没有分隔符和100+列的4 GB固定列宽文本文件转换为修剪的制表符分隔的文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

每月我收到几个非常大(〜4 GB)的固定列宽文本文件,需要将其导入MS SQL Server.要导入文件,必须将文件转换为带有制表符分隔的列值的文本文件,该列值的每个列值都修剪有空格(某些列没有空格).我想使用PowerShell解决此问题,并且我希望代码非常快.

Monthly I receive several very large (~ 4 GB) fixed column width text file that needs to be imported into MS SQL Server. To import the file, the file must be converted into a text file with tab-delimited column values with spaces trimmed from each column value (some columns have no spaces). I'd like to use PowerShell to solve this and I'd like the code to be very, very fast.

我尝试了许多代码迭代,但是到目前为止太慢或无法正常工作.我已经尝试了Microsoft Text Parser(太慢了).我试过正则表达式匹配.我正在安装Windows PowerShell的Windows 7计算机上工作.

I tried many iterations of code but so far too slow or not working. I've tried the Microsoft Text Parser (too slow). I've tried regex matching. I'm working on a Windows 7 machine with PowerShell 5.1 installed.

 ID         FIRST_NAME              LAST_NAME          COLUMN_NM_TOO_LON5THCOLUMN
 10000000001MINNIE                  MOUSE              COLUMN VALUE LONGSTARTS 

$infile = "C:\Testing\IN_AND_OUT_FILES\srctst.txt"
$outfile = "C:\Testing\IN_AND_OUT_FILES\outtst.txt"

$batch = 1

[regex]$match_regex = '^(.{10})(.{50})(.{50})(.{50})(.{50})(.{3})(.{8})(.{4})(.{50})(.{2})(.{30})(.{6})(.{3})(.{4})(.{25})(.{2})(.{10})(.{3})(.{8})(.{4})(.{50})(.{2})(.{30})(.{6})(.{3})(.{2})(.{25})(.{2})(.{10})(.{3})(.{10})(.{10})(.{10})(.{2})(.{10})(.{50})(.{50})(.{50})(.{50})(.{8})(.{4})(.{50})(.{2})(.{30})(.{6})(.{3})(.{2})(.{25})(.{2})(.{10})(.{3})(.{4})(.{2})(.{4})(.{10})(.{38})(.{38})(.{15})(.{1})(.{10})(.{2})(.{10})(.{10})(.{10})(.{10})(.{38})(.{38})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})$'
[regex]$replace_regex = "`${1}`t`${2}`t`${3}`t`${4}`t`${5}`t`${6}`t`${7}`t`${8}`t`${9}`t`${10}`t`${11}`t`${12}`t`${13}`t`${14}`t`${15}`t`${16}`t`${17}`t`${18}`t`${19}`t`${20}`t`${21}`t`${22}`t`${23}`t`${24}`t`${25}`t`${26}`t`${27}`t`${28}`t`${29}`t`${30}`t`${31}`t`${32}`t`${33}"

Get-Content $infile -ReadCount $batch |

    foreach {

        $_ -replace $match_regex, $replace_regex | Out-File $outfile -Append

    }

感谢您能提供的任何帮助!

Any help you can give is appreciated!

推荐答案

带有 -File 选项的 switch 语句是在PowerShell中处理大型文件的最快方法 [1] :

The switch statement with the -File option is the fastest way to process large files in PowerShell[1]:

& { 
  switch -File $infile -Regex  {
    $match_regex {
       # Join the what all the capture groups matched, trimmed, with a tab char.
       $Matches[1..($Matches.Count-1)].Trim() -join "`t"
    }
  }
} | Out-File $outFile # or: Set-Content $outFile (beware encoding issues)

具有文本输出, Out-File Set-Content 可以互换使用,但在 Windows PowerShell 中不能互换使用默认使用不同的字符编码(UTF-16LE与Ansi);根据需要使用 -Encoding ;PowerShell Core 始终使用无BOM的UTF-8.

With text output, Out-File and Set-Content can be used interchangeably, but not that in Windows PowerShell they use different character encodings by default (UTF-16LE vs. Ansi); use -Encoding as needed; PowerShell Core uses BOM-less UTF-8 consistently.

注意:

  • 要跳过标头行或单独捕获它,请为其提供单独的正则表达式,或者,如果标头也与数据行正则表达式匹配,请在初始化行索引变量之前 switch 语句(例如, $ i = 0 ),然后在处理脚本块中检查并递增该变量(例如, if($ i ++ -eq 0){...} ).

  • To skip the header row or capture it separately, either provide a separate regex for it, or, if the header also matches the data-row regex, initialize a line index variable before the switch statement (e.g., $i = 0) and check and increment that variable in the processing script block (e.g., if ($i++ -eq 0) { ... }).

.Trim()在由 $ Matches [1 ..($ Matches.Count-1)] ;此功能称为成员枚举

switch 语句包含在&中的原因{...} (一个此GitHub问题.

The reason that the switch statement is enclosed in & { ... } (a script block ({ ... }) invoked with the call operator (&)) is that compound statements such as switch / while, foreach (...), ... aren't directly supported as pipeline input - see this GitHub issue.

关于您尝试过的事情:

iRon 指出,您不应使用 $ Input 作为用户变量-这是一个 自动变量 由PowerShell管理,实际上,您分配给它的任何内容都会被安静地丢弃.

As iRon points out, you shouldn't use $Input as a user variable - it is an automatic variable managed by PowerShell, and, in fact, whatever you assign to it is quietly discarded.

AdminOfThings 指出:

  • $ element = $ _.trim()不起作用,因为您位于 foreach 循环中,不在使用 ForEach-Object cmdlet 的管道中(即使后者也被别名为 foreach ;仅使用 ForEach-Object会将 $ _ 设置为当前输入对象.

  • $element = $_.trim() doesn't work, because you're inside a foreach loop, not in the pipeline with a ForEach-Object cmdlet (even though the latter is also aliased to foreach; only with ForEach-Object would $_ be set to the current input object.

仅使用分隔符将数组的元素连接起来就不需要自定义函数.如上所示, -join 运算符直接执行此操作.

There is no need for custom function just for joining the elements of an array with a separator; the -join operator does that directly, as shown above.

Lee_Daily 显示如何直接将 -join $一起使用匹配数组,如上所述.

Lee_Daily shows how to use -join directly with the $Matches array, as used above.

一些助手:

Join-Str($ matches)

您应该改用 Join-Str $ matches :

在PowerShell中,像shell命令一样调用函数 - foo arg1 arg2 -不是像C#方法一样- foo(arg1,arg2);请参见 获取有关about_Parsing的帮助 .
如果使用分隔参数,则将构造一个 array ,该函数会将其视为单个参数.
为防止意外使用方法语法,请使用 Set-StrictMode -Version 2 或更高版本,但请注意其其他效果.

In PowerShell, functions are invoked like shell commands - foo arg1 arg2 - not like C# methods - foo(arg1, arg2); see Get-Help about_Parsing.
If you use , to separate arguments, you'll construct an array that a function sees as a single argument.
To prevent accidental use of method syntax, use Set-StrictMode -Version 2 or higher, but note its other effects.

|空空

几乎总是更快的输出抑制方法是使用 $ null = ... .

An almost always faster method of output suppression is to use $null = ... instead.

[1] Mark(OP)报告了显着的加速.对于4GB的文件, switch 解决方案需要7.7分钟的时间.
尽管在大多数情况下 switch 解决方案可能足够快,但此答案显示的解决方案可能对于较高的迭代次数,速度更快;此答案将其与 switch 解决方案进行对比,并显示具有不同迭代次数的基准.
除此之外,用C#编写的已编译解决方案是进一步提高性能的唯一方法.

[1] Mark (the OP) reports a dramatic speed-up compared to the Get-Content + ForEach-Object approach in the question (the switch solution takes 7.7 mins. for a 4GB file).
While a switch solution is likely fast enough in most scenarios, this answer shows a solution that may be faster for high iteration counts; this answer contrasts it with a switch solution and shows benchmarks with varying iteration counts.
Beyond that, a compiled solution written in, say, C#, is the only way to further improve performance.

这篇关于将没有分隔符和100+列的4 GB固定列宽文本文件转换为修剪的制表符分隔的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆