更有效的方式来修改CSV文件的内容 [英] More-efficient way to modify a CSV file's content

查看:233
本文介绍了更有效的方式来修改CSV文件的内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图删除一些查询的结果导出为CSV时SSMS 2012生成的碎片。



例如,它包括单词'NULL ' null 值,并将 datetime 值增加毫秒:

  DATE_COLUMN,DATETIME_COLUMN,TEXT_COLUMN,NUMBER_COLUMN 
2015-05-01,2015-05-01 23:00:00.000,LOREM IPSUM,10.3456
NULL,NULL ,NULL,0


$ b $ p

不幸的是,Excel不会自动 c $ c> datetime 正确的小数秒的值,这导致客户之间的混乱('我要求的日期字段发生了什么事情')和更多的工作对我来说(不得不转换CSV



目标是删除 NULL 的CSV文件。和 .000 值:

  DATE_COLUMN,DATETIME_COLUMN,TEXT_COLUMN,NUMBER_COLUMN 
2015-05-01,2015-05-01 23:00:00,LOREM IPSUM,10.3456
,,, 0

Excel将打开此文件并格式化,无需进一步的技术帮助。



为此,我写了:

 函数Invoke-CsvCleanser {

[CmdletBinding()]
Param(
[参数(Mandatory = $ true)]
[String]
$ Path,
[switch]
$ Nulls,
[switch]
$ Milliseconds


PROCESS {

#打开文件
$ data = Import-Csv $ path

#处理每行
$ data | Foreach-Object {

#处理每一列
Foreach($ _。PSObject.Properties中的$ property){

#如果列包含'NULL'它与''
if($ Nulls - 和($ property.Value -eq'NULL')){
$ property.Value = $ property.Value -replace'NULL',''
}

#如果列包含日期/时间值,删除毫秒
elseif($ Milliseconds -and(isDate($ property.Value))){
$属性.Value = $ property.Value -replace'.000',''
}
}

}

#保存文件
$ data | Export-Csv -Path $ Path -NoTypeInformation

}

}

函数IsDate($ object){
[Boolean] $ object -as [DateTime])
}

PS> Invoke-CsvCleanser'C:\Users\Foobar\Desktop\0000.csv'-Nulls -Milliseconds

这在文件大小很小,但对于大文件效率很低。理想情况下, Invoke-CsvCleanser 会使用管道。



有更好的方法吗? / p>

解决方案

导入CSV 始终将整个文件加载到内存中, 。这里是从我的回答这个问题修改脚本: CSV格式化 - 条带限定符从特定字段



它使用原始文件处理,因此应该明显更快。 NULL s和毫秒使用regex匹配\replaced。脚本可以批量转换CSV。



正则表达式拆分CSV来自此问题:

将此脚本另存为 Invoke-CsvCleanser.ps1 。它接受以下参数:




  • InPath 如果未指定,则使用当前目录。

  • OutPath

  • 编码:如果未指定,脚本将使用系统的当前ANSI代码页来读取文件。您可以在PowerShell控制台中获取其他有效的编码,如下所示: [System.Text.Encoding] :: GetEncodings()

  • DoubleQuotes :切换 / strong>: 毫秒: : 详细脚本会通过写入详细消息告诉您发生了什么。



示例:



处理文件夹中的所有CSV C:\CSVs_are_here ,strip NULL和milliseconds,将处理的CSV文件保存到 C:\Processed_CSVs 文件夹中:

  .\Invoke-CsvCleanser.ps1 -InPath'C:\CSVs_are_here'-OutPath'C:\Processed_CSVs'-Nulls  - Milliseconds -Vb 

Invoke-CsvCleanser.ps1

  Param 

[ParameterFromPipelineByPropertyName = $ true )]
[ValidateScript({
if(!(Test-Path -LiteralPath $ _ -PathType Container))
{
throw输入文件夹不存在:$ _
}
$ true
})]
[ValidateNotNullOrEmpty()]
[string] $ InPath =(Get-Location -PSProvider FileSystem).Path,

[Parameter(Mandatory = $ true,ValueFromPipelineByPropertyName = $ true)]
[ValidateScript({
if(!(Test-Path -LiteralPath $ _ -PathType Container))
{
try
{
New-Item -ItemType Directory -Path $ _ -Force
}
catch
{
throw 无法创建输出文件夹:$ _
}
}
$ true
})]
[ValidateNotNullOrEmpty()]
[string] $ OutPath,

[Parameter(ValueFromPipelineByPropertyName = $ true)]
[string] $ Encoding ='Default',

[switch] $ Nulls,

[switch] $ Milliseconds,

[switch] $ DoubleQuotes



if($ Encoding -eq'Default ')
{
#设置默认编码
$ FileEncoding = [System.Text.Encoding] ::默认
}
else
{
#尝试设置用户指定的编码
try
{
$ FileEncoding = [System.Text.Encoding] :: GetEncoding($ Encoding)
}
catch
{
throw无效编码:$ Encoding
}
}

$ DQuotes =''
$ Separator = ,'
#http://stackoverflow.com/questions/15927291/how-to-split-a-string-by-comma-ignoring-comma-in-double-quotes
$ SplitRegex = $ Separator(?=(?:[^ $ DQuotes] | $ DQuotes [^ $ DQuotes] * $ DQuotes)* $)
#Regef匹配NULL
$ NullRegex ='^ NULL $'
#正则表达式匹配毫秒:23:00:00.
$ MillisecondsRegex ='(\d {2}:\d {2}:\d {2})(\.\\ \\ d {3})'

Write-Verbose输入文件夹:$ InPath
Write-Verbose输出文件夹:$ OutPath

每个CSV文件在$ InPath
Get-ChildItem -LiteralPath $ InPath -Filter'* .csv'|
ForEach-Object {
Write-Verbose当前文件:$($ _。FullName)
$ InFile = New-Object -TypeName System.IO.StreamReader -ArgumentList b $ _。FullName,
$ FileEncoding
)-ErrorAction Stop
Write-Verbose'创建新的StreamReader'

$ OutFile = New-Object -TypeName系统。 IO.StreamWriter -ArgumentList(
(Join-Path -Path $ OutPath -ChildPath $ _。Name),
$ false,
$ FileEncoding
) - ErrorAction Stop
Write-Verbose'创建新StreamWriter'

写入Verbose'正在处理文件...'
while(($ line = $ InFile.ReadLine())-ne $ null)
{
$ tmp = $ line -split $ SplitRegex |
ForEach-Object {

#周围的引号
if($ DoubleQuotes)
{
$ _ = $ _。Trim($ DQuotes)
}

#剥离NULL字符串
if($ Nulls)
{
$ _ = $ _ -replace $ NullRegex,''
}

#Strip milliseconds
if($ Milliseconds)
{
$ _ = $ _-replace $ MillisecondsRegex,'$ 1'
}

#将当前对象输出到管道
$ _
}
#将行写入新的CSV文件
$ OutFile.WriteLine($ tmp -join $ Separator )
}

写 - Verbose完成处理文件:$($ _。FullName)
Write-Verbose处理的文件另存为:$($ OutFile.BaseStream .Name)

#关闭打开的文件和清理对象
$ OutFile.Flush()
$ OutFile.Close()
$ OutFile.Dispose()

$ InFile.Close()
$ InFile.Dispose()
}

结果:

  DATE_COLUMN,DATETIME_COLUMN,TEXT_COLUMN,NUMBER_COLUMN 
2015-05-01, 2015-05-01 23:00:00,LOREM IPSUM,10.3456
,,, 0



< blockquote>

这很有趣,看看是否可以传递lambdas 作为一种方式
使文件处理更加灵活。每个lambda将执行一个
特定的活动(删除NULL,上壳,正常化文本,
等。)


此版本可以完全控制CSV处理。



示例: 条 NULL s,strip milliseconds,然后剥去双引号。

  .\Invoke-CsvCleanser.ps1 -InPath'C:\CSVs_are_here'-OutPath'C:\Processed_CSVs'-Action {$ _ = $ _ -replace'^ NULL $',' },{$ _ = $ _ -replace'(\d {2}:\d {2}:\d {2})(\.\d {3})','$ 1'} ,{$ _ = $ _。Trim(''')} 

Invoke-CsvCleanser.ps1 withlambdas:

  Param $ 

[$(
[$($ TestBar)
throw输入文件夹不存在:$ _
}
$ true
})]
[ValidateNotNullOrEmpty()]
[string] $ InPath =(Get-Location -PSProvider FileSystem).Path,

[参数(Mandatory = $ true,ValueFromPipelineByPropertyName = $ true)]
[ValidateScript({
if !(Test-Path -LiteralPath $ _ -PathType Container))
{
try
{
New-Item -ItemType Directory -Path $ _ -Force
}
catch
{
throw无法创建输出文件夹:$ _
}
}
$ true
})]
[ValidateNotNullOrEmpty()]
[string] $ OutPath,

[Parameter(ValueFromPipelineByPropertyName = $ true)]
[string] $ Encoding ='Default',

[参数(Mandatory = $ true,ValueFromPipelineByPropertyName = $ true)]
[scriptblock []] $ Action



if ($ Encoding -eq'Default')
{
#设置默认编码
$ FileEncoding = [System.Text.Encoding] ::默认
}
else
{
#尝试设置用户指定的编码
try
{
$ FileEncoding = [System.Text.Encoding] :: GetEncoding($ Encoding)
}
catch
{
throw无效编码:$ Encoding
}
}

$ DQuotes =''
$ Separator =','
#http://stackoverflow.com/questions/15927291/how-to-split-a-string-by-comma-ignoring-comma-in-double-quotes
$ SplitRegex =$ Separator(?=(?:[^ $ DQuotes] | $ DQuotes [^ $ DQuotes] * $ DQuotes)* $)

Write-Verbose输入文件夹:$ InPath
Write-Verbose输出文件夹:$ OutPath

#在$ InPath中迭代每个CSV文件
Get-ChildItem -LiteralPath $ InPath -Filter' * .csv'|
ForEach-Object {
Write-Verbose当前文件:$($ _。FullName)
$ InFile = New-Object -TypeName System.IO.StreamReader -ArgumentList b $ _。FullName,
$ FileEncoding
)-ErrorAction Stop
Write-Verbose'创建新的StreamReader'

$ OutFile = New-Object -TypeName系统。 IO.StreamWriter -ArgumentList(
(Join-Path -Path $ OutPath -ChildPath $ _。Name),
$ false,
$ FileEncoding
) - ErrorAction Stop
Write-Verbose'创建新StreamWriter'

写入Verbose'正在处理文件...'
while(($ line = $ InFile.ReadLine())-ne $ null)
{
$ tmp = $ line -split $ SplitRegex |
ForEach-Object {
#处理每个项目
foreach($ Action中的$ scriptblock){
。 $ scriptblock
}
#输出当前对象到管道
$ _
}
#将行写入新的CSV文件
$ OutFile.WriteLine($ tmp -join $ Separator)
}

写 - Verbose完成处理文件:$($ _。FullName)
Write-Verbose处理的文件另存为:$ ($ OutFile.BaseStream.Name)

#关闭打开的文件和清理对象
$ OutFile.Flush()
$ OutFile.Close()
$ OutFile .Dispose()

$ InFile.Close()
$ InFile.Dispose()
}


I'm attempting to remove some of the detritus that SSMS 2012 generates when a query's results are exported as CSV.

For example, it includes the word 'NULL' for null values and adds milliseconds to datetime values:

DATE_COLUMN,DATETIME_COLUMN,TEXT_COLUMN,NUMBER_COLUMN
2015-05-01,2015-05-01 23:00:00.000,LOREM IPSUM,10.3456
NULL,NULL,NULL,0

Unfortunately, Excel doesn't automatically format datetime values with fractional seconds correctly, which lead to confusion amongst the customers ('What happened to the date field that I requested?') and more work for me (having to convert the CSV to XLSX and format the columns correctly prior to distribution).

The goal is to strip the CSV file of NULL and .000 values:

DATE_COLUMN,DATETIME_COLUMN,TEXT_COLUMN,NUMBER_COLUMN
2015-05-01,2015-05-01 23:00:00,LOREM IPSUM,10.3456
,,,0

Excel will open this file and format it properly without further technical assistance.

To that end, I wrote:

Function Invoke-CsvCleanser {

  [CmdletBinding()]
  Param(
    [parameter(Mandatory=$true)]
    [String]
    $Path,
    [switch]
    $Nulls,
    [switch]
    $Milliseconds
  )

  PROCESS {

    # open the file
    $data = Import-Csv $path

    # process each row
    $data | Foreach-Object { 

        # process each column
        Foreach ($property in $_.PSObject.Properties) {

            # if column contains 'NULL', replace it with ''
            if ($Nulls -and ($property.Value -eq 'NULL')) {
                $property.Value = $property.Value -replace 'NULL', ''
            }

            # if column contains a date/time value, remove milliseconds
            elseif ( $Milliseconds -and (isDate($property.Value)) ) {
                $property.Value = $property.Value -replace '.000', ''    
            }
        } 

    } 

    # save file
    $data | Export-Csv -Path $Path -NoTypeInformation

  }

}

function IsDate($object) {
    [Boolean]($object -as [DateTime])
}

PS> Invoke-CsvCleanser 'C:\Users\Foobar\Desktop\0000.csv' -Nulls -Milliseconds

This works fine when the file size is small, but is quite inefficient for large files. Ideally, Invoke-CsvCleanser would make use of the pipeline.

Is there a better way to do this?

解决方案

Import-CSV always loads entire file in memory, so it's slow. Here is modified script from my answer to this question: CSV formatting - strip qualifier from specific fields.

It uses raw file processing, so it should be significantly faster. NULLs and milliseconds are matched\replaced using regex. Script is able to mass-convert CSV's.

Regex to split CSV is from this question: How to split a string by comma ignoring comma in double quotes

Save this script as Invoke-CsvCleanser.ps1. It accepts following arguments:

  • InPath: folder to read CSVs from. If not specified, the current directory is used.
  • OutPath: folder to save processed CSVs to. Will be created, if not exist.
  • Encoding: If not specified, script will use system's current ANSI code page to read the files. You can get other valid encodings for your system in PowerShell console like this: [System.Text.Encoding]::GetEncodings()
  • DoubleQuotes: switch, if specified, surrounding double quotes will be stripped from values
  • Nulls: switch, if specified, NULL strings will be stripped from values
  • Milliseconds: switch, if specified, .000 strings will be stripped from values
  • Verbose: script will tell you what's going on via Write-Verbose messages.

Example:

Process all CSVs in the folder C:\CSVs_are_here, strip NULLs and milliseconds, save processed CSVs to the folder C:\Processed_CSVs, be verbose:

.\Invoke-CsvCleanser.ps1 -InPath 'C:\CSVs_are_here' -OutPath 'C:\Processed_CSVs' -Nulls -Milliseconds -Verbose

Invoke-CsvCleanser.ps1 script:

Param
(
    [Parameter(ValueFromPipelineByPropertyName = $true)]
    [ValidateScript({
        if(!(Test-Path -LiteralPath $_ -PathType Container))
        {
            throw "Input folder doesn't exist: $_"
        }
        $true
    })]
    [ValidateNotNullOrEmpty()]
    [string]$InPath = (Get-Location -PSProvider FileSystem).Path,

    [Parameter(Mandatory = $true, ValueFromPipelineByPropertyName = $true)]
    [ValidateScript({
        if(!(Test-Path -LiteralPath $_ -PathType Container))
        {
            try
            {
                New-Item -ItemType Directory -Path $_ -Force
            }
            catch
            {
                throw "Can't create output folder: $_"
            }
        }
        $true
    })]
    [ValidateNotNullOrEmpty()]
    [string]$OutPath,

    [Parameter(ValueFromPipelineByPropertyName = $true)]
    [string]$Encoding = 'Default',

    [switch]$Nulls,

    [switch]$Milliseconds,

    [switch]$DoubleQuotes
)


if($Encoding -eq 'Default')
{
    # Set default encoding
    $FileEncoding = [System.Text.Encoding]::Default
}
else
{
    # Try to set user-specified encoding
    try
    {
        $FileEncoding = [System.Text.Encoding]::GetEncoding($Encoding)
    }
    catch
    {
        throw "Not valid encoding: $Encoding"
    }
}

$DQuotes = '"'
$Separator = ','
# http://stackoverflow.com/questions/15927291/how-to-split-a-string-by-comma-ignoring-comma-in-double-quotes
$SplitRegex = "$Separator(?=(?:[^$DQuotes]|$DQuotes[^$DQuotes]*$DQuotes)*$)"
# Regef to match NULL
$NullRegex = '^NULL$'
# Regex to match milliseconds: 23:00:00.000
$MillisecondsRegex = '(\d{2}:\d{2}:\d{2})(\.\d{3})'

Write-Verbose "Input folder: $InPath"
Write-Verbose "Output folder: $OutPath"

# Iterate over each CSV file in the $InPath
Get-ChildItem -LiteralPath $InPath -Filter '*.csv' |
    ForEach-Object {
        Write-Verbose "Current file: $($_.FullName)"
        $InFile = New-Object -TypeName System.IO.StreamReader -ArgumentList (
            $_.FullName,
            $FileEncoding
        ) -ErrorAction Stop
        Write-Verbose 'Created new StreamReader'

        $OutFile = New-Object -TypeName System.IO.StreamWriter -ArgumentList (
            (Join-Path -Path $OutPath -ChildPath $_.Name),
            $false,
            $FileEncoding
        ) -ErrorAction Stop
        Write-Verbose 'Created new StreamWriter'

        Write-Verbose 'Processing file...'
        while(($line = $InFile.ReadLine()) -ne $null)
        {
            $tmp = $line -split $SplitRegex |
                        ForEach-Object {

                            # Strip surrounding quotes
                            if($DoubleQuotes)
                            {
                                $_ = $_.Trim($DQuotes)
                            }

                            # Strip NULL strings
                            if($Nulls)
                            {
                                $_ = $_ -replace $NullRegex, ''
                            }

                            # Strip milliseconds
                            if($Milliseconds)
                            {
                                $_ = $_ -replace $MillisecondsRegex, '$1'
                            }

                            # Output current object to pipeline
                            $_
                        }
            # Write line to the new CSV file
            $OutFile.WriteLine($tmp -join $Separator)
        }

        Write-Verbose "Finished processing file: $($_.FullName)"
        Write-Verbose "Processed file is saved as: $($OutFile.BaseStream.Name)"

        # Close open files and cleanup objects
        $OutFile.Flush()
        $OutFile.Close()
        $OutFile.Dispose()

        $InFile.Close()
        $InFile.Dispose()
    }

Result:

DATE_COLUMN,DATETIME_COLUMN,TEXT_COLUMN,NUMBER_COLUMN
2015-05-01,2015-05-01 23:00:00,LOREM IPSUM,10.3456
,,,0

It would be interesting to see if one could pass lambdas as a way to make the file processing more flexible. Each lambda would perform a specific activity (removing NULLs, upper-casing, normalizing text, etc.)

This version gives full control over CSV processing. Just pass a scriptblock(s) to the Action parameter in the order you want them to execute.

Example: strip NULLs, strip milliseconds and then strip double quotes.

.\Invoke-CsvCleanser.ps1 -InPath 'C:\CSVs_are_here' -OutPath 'C:\Processed_CSVs' -Action {$_ = $_ -replace '^NULL$', '' }, {$_ = $_ -replace '(\d{2}:\d{2}:\d{2})(\.\d{3})', '$1'}, {$_ = $_.Trim('"')}

Invoke-CsvCleanser.ps1 with "lambdas":

Param
(
    [Parameter(ValueFromPipelineByPropertyName = $true)]
    [ValidateScript({
        if(!(Test-Path -LiteralPath $_ -PathType Container))
        {
            throw "Input folder doesn't exist: $_"
        }
        $true
    })]
    [ValidateNotNullOrEmpty()]
    [string]$InPath = (Get-Location -PSProvider FileSystem).Path,

    [Parameter(Mandatory = $true, ValueFromPipelineByPropertyName = $true)]
    [ValidateScript({
        if(!(Test-Path -LiteralPath $_ -PathType Container))
        {
            try
            {
                New-Item -ItemType Directory -Path $_ -Force
            }
            catch
            {
                throw "Can't create output folder: $_"
            }
        }
        $true
    })]
    [ValidateNotNullOrEmpty()]
    [string]$OutPath,

    [Parameter(ValueFromPipelineByPropertyName = $true)]
    [string]$Encoding = 'Default',

    [Parameter(Mandatory = $true, ValueFromPipelineByPropertyName = $true)]    
    [scriptblock[]]$Action
)


if($Encoding -eq 'Default')
{
    # Set default encoding
    $FileEncoding = [System.Text.Encoding]::Default
}
else
{
    # Try to set user-specified encoding
    try
    {
        $FileEncoding = [System.Text.Encoding]::GetEncoding($Encoding)
    }
    catch
    {
        throw "Not valid encoding: $Encoding"
    }
}

$DQuotes = '"'
$Separator = ','
# http://stackoverflow.com/questions/15927291/how-to-split-a-string-by-comma-ignoring-comma-in-double-quotes
$SplitRegex = "$Separator(?=(?:[^$DQuotes]|$DQuotes[^$DQuotes]*$DQuotes)*$)"

Write-Verbose "Input folder: $InPath"
Write-Verbose "Output folder: $OutPath"

# Iterate over each CSV file in the $InPath
Get-ChildItem -LiteralPath $InPath -Filter '*.csv' |
    ForEach-Object {
        Write-Verbose "Current file: $($_.FullName)"
        $InFile = New-Object -TypeName System.IO.StreamReader -ArgumentList (
            $_.FullName,
            $FileEncoding
        ) -ErrorAction Stop
        Write-Verbose 'Created new StreamReader'

        $OutFile = New-Object -TypeName System.IO.StreamWriter -ArgumentList (
            (Join-Path -Path $OutPath -ChildPath $_.Name),
            $false,
            $FileEncoding
        ) -ErrorAction Stop
        Write-Verbose 'Created new StreamWriter'

        Write-Verbose 'Processing file...'
        while(($line = $InFile.ReadLine()) -ne $null)
        {
            $tmp =  $line -split $SplitRegex |
                        ForEach-Object {
                            # Process each item
                            foreach($scriptblock in $Action) {
                                . $scriptblock
                            }
                            # Output current object to pipeline
                            $_
                        }
            # Write line to the new CSV file
            $OutFile.WriteLine($tmp -join $Separator)
        }

        Write-Verbose "Finished processing file: $($_.FullName)"
        Write-Verbose "Processed file is saved as: $($OutFile.BaseStream.Name)"

        # Close open files and cleanup objects
        $OutFile.Flush()
        $OutFile.Close()
        $OutFile.Dispose()

        $InFile.Close()
        $InFile.Dispose()
    }

这篇关于更有效的方式来修改CSV文件的内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆