如何在 Powershell 中对 3000 万条 csv 记录进行排序 [英] How to sort 30Million csv records in Powershell

查看:90
本文介绍了如何在 Powershell 中对 3000 万条 csv 记录进行排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 oledbconnection 对 csv 文件的第一列进行排序.Oledb 连接在 6 分钟内成功执行了多达 900 万条记录.但是当我执行 1000 万条记录时,收到以下警报消息.

I am using oledbconnection to sort the first column of csv file. Oledb connection is executed up to 9 million records within 6 min duration successfully. But when am executing 10 million records, getting following alert message.

调用ExecuteReader"的异常与0"参数:查询无法完成.查询结果的大小大于数据库的最大大小 (2 GB),或磁盘上没有足够的临时存储空间来存储查询结果."

Exception calling "ExecuteReader" with "0" argument(s): "The query cannot be completed. Either the size of the query result is larger than the maximum size of a database (2 GB), or there is not enough temporary storage space on the disk to store the query result."

是否有其他解决方案可以使用 Powershell 对 3000 万进行排序?

is there any other solution to sort 30 million using Powershell?

这是我的脚本

$OutputFile = "D:\Performance_test_data\output1.csv"
$stream = [System.IO.StreamWriter]::new( $OutputFile )

$sb = [System.Text.StringBuilder]::new()
$sw = [Diagnostics.Stopwatch]::StartNew()

$conn = New-Object System.Data.OleDb.OleDbConnection("Provider=Microsoft.ACE.OLEDB.12.0;Data Source='D:\Performance_test_data\';Extended Properties='Text;HDR=Yes;CharacterSet=65001;FMT=Delimited';")
$cmd=$conn.CreateCommand()
$cmd.CommandText="Select * from 1crores.csv order by col6"

$conn.open()

$data = $cmd.ExecuteReader()

echo "Query has been completed!"
$stream.WriteLine( "col1,col2,col3,col4,col5,col6")

while ($data.read()) 
{ 
  $stream.WriteLine( $data.GetValue(0) +',' + $data.GetValue(1)+',' + $data.GetValue(2)+',' + $data.GetValue(3)+',' + $data.GetValue(4)+',' + $data.GetValue(5))

}
echo "data written successfully!!!"

$stream.close()
$sw.Stop()
$sw.Elapsed

$cmd.Dispose()
$conn.Dispose()

推荐答案

我添加了一个新答案,因为这是解决此问题的完全不同的方法.
与其创建临时文件(这可能会导致大量文件打开和关闭),您可以考虑创建一个索引的有序列表,而不是检查输入文件(-FilePath) 多次,每次都处理选定数量的行(-BufferSize = 1Gb,您可能需要调整此内存使用与性能";参数):

I have added a new answer as this is a complete different approach to tackle this issue.
Instead of creating temporary files (which presumable causes a lot of file opens and closures), you might consider to create a ordered list of indices and than go over the input file (-FilePath) multiple times and each time, process a selective number of lines (-BufferSize = 1Gb, you might have to tweak this "memory usage vs. performance" parameter):

Function Sort-Csv {
    [CmdletBinding()] param(
        [string]$InputFile,
        [String]$Property,
        [string]$OutputFile,
        [Char]$Delimiter = ',',
        [System.Text.Encoding]$Encoding = [System.Text.Encoding]::Default,
        [Int]$BufferSize = 1Gb
    )
    Begin {
        if ($InputFile.StartsWith('.\')) { $InputFile = Join-Path (Get-Location) $InputFile }
        $Index = 0
        $Dictionary = [System.Collections.Generic.SortedDictionary[string, [Collections.Generic.List[Int]]]]::new()
        Import-Csv $InputFile -Delimiter $Delimiter -Encoding $Encoding | Foreach-Object { 
            if (!$Dictionary.ContainsKey($_.$Property)) { $Dictionary[$_.$Property] = [Collections.Generic.List[Int]]::new() }
            $Dictionary[$_.$Property].Add($Index++)
        }
        $Indices = [int[]]($Dictionary.Values | ForEach-Object { $_ })
        $Dictionary = $Null                                     # we only need the sorted index list
    }
    Process {
        $Start = 0
        $ChunkSize = [int]($BufferSize / (Get-Item $InputFile).Length * $Indices.Count / 2.2)
        While ($Start -lt $Indices.Count) {
            [System.GC]::Collect()
            $End = $Start + $ChunkSize - 1
            if ($End -ge $Indices.Count) { $End = $Indices.Count - 1 }
            $Chunk = @{}
            For ($i = $Start; $i -le $End; $i++) { $Chunk[$Indices[$i]] = $i }
            $Reader = [System.IO.StreamReader]::new($InputFile, $Encoding)
            $Header = $Reader.ReadLine()
            $i = $Start
            $Count = 0
            For ($i = 0; ($Line = $Reader.ReadLine()) -and $Count -lt $ChunkSize; $i++) {
                if ($Chunk.Contains($i)) { $Chunk[$i] = $Line }
            }
            $Reader.Dispose()
            if ($OutputFile) {
                if ($OutputFile.StartsWith('.\')) { $OutputFile = Join-Path (Get-Location) $OutputFile }
                $Writer = [System.IO.StreamWriter]::new($OutputFile, ($Start -ne 0), $Encoding)
                if ($Start -eq 0) { $Writer.WriteLine($Header) }
                For ($i = $Start; $i -le $End; $i++) { $Writer.WriteLine($Chunk[$Indices[$i]]) }
                $Writer.Dispose()
            } else {
                $Start..$End | ForEach-Object { $Header } { $Chunk[$Indices[$_]] } | ConvertFrom-Csv -Delimiter $Delimiter
            }
            $Chunk = $Null
            $Start = $End + 1
        }
    }
}

基本用法

Sort-Csv .\Input.csv <PropertyName> -Output .\Output.csv
Sort-Csv .\Input.csv <PropertyName> | ... | Export-Csv .\Output.csv

请注意,对于 1Crones.csv,它可能只会一次性导出完整文件,除非您将 -BufferSize 设置为较低的数量,例如500Kb.

Note that for 1Crones.csv it will probably just export the full file in once unless you set the -BufferSize to a lower amount e.g. 500Kb.

这篇关于如何在 Powershell 中对 3000 万条 csv 记录进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆