更有效的方式来修改CSV文件的内容 [英] More-efficient way to modify a CSV file's content
问题描述
我试图删除一些查询的结果导出为CSV时SSMS 2012生成的碎片。
例如,它包括单词'NULL ' null
值,并将 datetime
值增加毫秒:
DATE_COLUMN,DATETIME_COLUMN,TEXT_COLUMN,NUMBER_COLUMN
2015-05-01,2015-05-01 23:00:00.000,LOREM IPSUM,10.3456
NULL,NULL ,NULL,0
$ b $ p
不幸的是,Excel不会自动 c $ c> datetime 正确的小数秒的值,这导致客户之间的混乱('我要求的日期字段发生了什么事情')和更多的工作对我来说(不得不转换CSV
目标是删除 NULL
的CSV文件。和 .000
值:
DATE_COLUMN,DATETIME_COLUMN,TEXT_COLUMN,NUMBER_COLUMN
2015-05-01,2015-05-01 23:00:00,LOREM IPSUM,10.3456
,,, 0
Excel将打开此文件并格式化,无需进一步的技术帮助。
为此,我写了:
函数Invoke-CsvCleanser {
[CmdletBinding()]
Param(
[参数(Mandatory = $ true)]
[String]
$ Path,
[switch]
$ Nulls,
[switch]
$ Milliseconds
)
PROCESS {
#打开文件
$ data = Import-Csv $ path
#处理每行
$ data | Foreach-Object {
#处理每一列
Foreach($ _。PSObject.Properties中的$ property){
#如果列包含'NULL'它与''
if($ Nulls - 和($ property.Value -eq'NULL')){
$ property.Value = $ property.Value -replace'NULL',''
}
#如果列包含日期/时间值,删除毫秒
elseif($ Milliseconds -and(isDate($ property.Value))){
$属性.Value = $ property.Value -replace'.000',''
}
}
}
#保存文件
$ data | Export-Csv -Path $ Path -NoTypeInformation
}
}
函数IsDate($ object){
[Boolean] $ object -as [DateTime])
}
PS> Invoke-CsvCleanser'C:\Users\Foobar\Desktop\0000.csv'-Nulls -Milliseconds
这在文件大小很小,但对于大文件效率很低。理想情况下, Invoke-CsvCleanser
会使用管道。
有更好的方法吗? / p>
导入CSV
始终将整个文件加载到内存中, 。这里是从我的回答这个问题修改脚本: CSV格式化 - 条带限定符从特定字段。
它使用原始文件处理,因此应该明显更快。 NULL
s和毫秒使用regex匹配\replaced。脚本可以批量转换CSV。
正则表达式拆分CSV来自此问题: 将此脚本另存为 示例: 处理文件夹中的所有CSV 结果: 这很有趣,看看是否可以传递lambdas 作为一种方式 此版本可以完全控制CSV处理。 示例: 条 I'm attempting to remove some of the detritus that SSMS 2012 generates when a query's results are exported as CSV. For example, it includes the word 'NULL' for Unfortunately, Excel doesn't automatically format The goal is to strip the CSV file of Excel will open this file and format it properly without further technical assistance. To that end, I wrote: This works fine when the file size is small, but is quite inefficient for large files. Ideally, Is there a better way to do this? It uses raw file processing, so it should be significantly faster. Regex to split CSV is from this question: How to split a string by comma ignoring comma in double quotes Save this script as Example: Process all CSVs in the folder Result:
It would be interesting to see if one could pass lambdas as a way
to make the file processing more flexible. Each lambda would perform a
specific activity (removing NULLs, upper-casing, normalizing text,
etc.) This version gives full control over CSV processing. Just pass a scriptblock(s) to the Example: strip
这篇关于更有效的方式来修改CSV文件的内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
Invoke-CsvCleanser.ps1
。它接受以下参数:
[System.Text.Encoding] :: GetEncodings()
写入详细
消息告诉您发生了什么。
C:\CSVs_are_here
,strip NULL和milliseconds,将处理的CSV文件保存到 C:\Processed_CSVs
文件夹中:
.\Invoke-CsvCleanser.ps1 -InPath'C:\CSVs_are_here'-OutPath'C:\Processed_CSVs'-Nulls - Milliseconds -Vb
Invoke-CsvCleanser.ps1
Param
(
[ParameterFromPipelineByPropertyName = $ true )]
[ValidateScript({
if(!(Test-Path -LiteralPath $ _ -PathType Container))
{
throw输入文件夹不存在:$ _
}
$ true
})]
[ValidateNotNullOrEmpty()]
[string] $ InPath =(Get-Location -PSProvider FileSystem).Path,
[Parameter(Mandatory = $ true,ValueFromPipelineByPropertyName = $ true)]
[ValidateScript({
if(!(Test-Path -LiteralPath $ _ -PathType Container))
{
try
{
New-Item -ItemType Directory -Path $ _ -Force
}
catch
{
throw 无法创建输出文件夹:$ _
}
}
$ true
})]
[ValidateNotNullOrEmpty()]
[string] $ OutPath,
[Parameter(ValueFromPipelineByPropertyName = $ true)]
[string] $ Encoding ='Default',
[switch] $ Nulls,
[switch] $ Milliseconds,
[switch] $ DoubleQuotes
)
if($ Encoding -eq'Default ')
{
#设置默认编码
$ FileEncoding = [System.Text.Encoding] ::默认
}
else
{
#尝试设置用户指定的编码
try
{
$ FileEncoding = [System.Text.Encoding] :: GetEncoding($ Encoding)
}
catch
{
throw无效编码:$ Encoding
}
}
$ DQuotes =''
$ Separator = ,'
#http://stackoverflow.com/questions/15927291/how-to-split-a-string-by-comma-ignoring-comma-in-double-quotes
$ SplitRegex = $ Separator(?=(?:[^ $ DQuotes] | $ DQuotes [^ $ DQuotes] * $ DQuotes)* $)
#Regef匹配NULL
$ NullRegex ='^ NULL $'
#正则表达式匹配毫秒:23:00:00.
$ MillisecondsRegex ='(\d {2}:\d {2}:\d {2})(\.\\ \\ d {3})'
Write-Verbose输入文件夹:$ InPath
Write-Verbose输出文件夹:$ OutPath
每个CSV文件在$ InPath
Get-ChildItem -LiteralPath $ InPath -Filter'* .csv'|
ForEach-Object {
Write-Verbose当前文件:$($ _。FullName)
$ InFile = New-Object -TypeName System.IO.StreamReader -ArgumentList b $ _。FullName,
$ FileEncoding
)-ErrorAction Stop
Write-Verbose'创建新的StreamReader'
$ OutFile = New-Object -TypeName系统。 IO.StreamWriter -ArgumentList(
(Join-Path -Path $ OutPath -ChildPath $ _。Name),
$ false,
$ FileEncoding
) - ErrorAction Stop
Write-Verbose'创建新StreamWriter'
写入Verbose'正在处理文件...'
while(($ line = $ InFile.ReadLine())-ne $ null)
{
$ tmp = $ line -split $ SplitRegex |
ForEach-Object {
#周围的引号
if($ DoubleQuotes)
{
$ _ = $ _。Trim($ DQuotes)
}
#剥离NULL字符串
if($ Nulls)
{
$ _ = $ _ -replace $ NullRegex,''
}
#Strip milliseconds
if($ Milliseconds)
{
$ _ = $ _-replace $ MillisecondsRegex,'$ 1'
}
#将当前对象输出到管道
$ _
}
#将行写入新的CSV文件
$ OutFile.WriteLine($ tmp -join $ Separator )
}
写 - Verbose完成处理文件:$($ _。FullName)
Write-Verbose处理的文件另存为:$($ OutFile.BaseStream .Name)
#关闭打开的文件和清理对象
$ OutFile.Flush()
$ OutFile.Close()
$ OutFile.Dispose()
$ InFile.Close()
$ InFile.Dispose()
}
DATE_COLUMN,DATETIME_COLUMN,TEXT_COLUMN,NUMBER_COLUMN
2015-05-01, 2015-05-01 23:00:00,LOREM IPSUM,10.3456
,,, 0
< blockquote>
使文件处理更加灵活。每个lambda将执行一个
特定的活动(删除NULL,上壳,正常化文本,
等。)
NULL
s,strip milliseconds,然后剥去双引号。
.\Invoke-CsvCleanser.ps1 -InPath'C:\CSVs_are_here'-OutPath'C:\Processed_CSVs'-Action {$ _ = $ _ -replace'^ NULL $',' },{$ _ = $ _ -replace'(\d {2}:\d {2}:\d {2})(\.\d {3})','$ 1'} ,{$ _ = $ _。Trim(''')}
Invoke-CsvCleanser.ps1
withlambdas:
Param $
(
[$(
[$($ TestBar)
throw输入文件夹不存在:$ _
}
$ true
})]
[ValidateNotNullOrEmpty()]
[string] $ InPath =(Get-Location -PSProvider FileSystem).Path,
[参数(Mandatory = $ true,ValueFromPipelineByPropertyName = $ true)]
[ValidateScript({
if !(Test-Path -LiteralPath $ _ -PathType Container))
{
try
{
New-Item -ItemType Directory -Path $ _ -Force
}
catch
{
throw无法创建输出文件夹:$ _
}
}
$ true
})]
[ValidateNotNullOrEmpty()]
[string] $ OutPath,
[Parameter(ValueFromPipelineByPropertyName = $ true)]
[string] $ Encoding ='Default',
[参数(Mandatory = $ true,ValueFromPipelineByPropertyName = $ true)]
[scriptblock []] $ Action
)
if ($ Encoding -eq'Default')
{
#设置默认编码
$ FileEncoding = [System.Text.Encoding] ::默认
}
else
{
#尝试设置用户指定的编码
try
{
$ FileEncoding = [System.Text.Encoding] :: GetEncoding($ Encoding)
}
catch
{
throw无效编码:$ Encoding
}
}
$ DQuotes =''
$ Separator =','
#http://stackoverflow.com/questions/15927291/how-to-split-a-string-by-comma-ignoring-comma-in-double-quotes
$ SplitRegex =$ Separator(?=(?:[^ $ DQuotes] | $ DQuotes [^ $ DQuotes] * $ DQuotes)* $)
Write-Verbose输入文件夹:$ InPath
Write-Verbose输出文件夹:$ OutPath
#在$ InPath中迭代每个CSV文件
Get-ChildItem -LiteralPath $ InPath -Filter' * .csv'|
ForEach-Object {
Write-Verbose当前文件:$($ _。FullName)
$ InFile = New-Object -TypeName System.IO.StreamReader -ArgumentList b $ _。FullName,
$ FileEncoding
)-ErrorAction Stop
Write-Verbose'创建新的StreamReader'
$ OutFile = New-Object -TypeName系统。 IO.StreamWriter -ArgumentList(
(Join-Path -Path $ OutPath -ChildPath $ _。Name),
$ false,
$ FileEncoding
) - ErrorAction Stop
Write-Verbose'创建新StreamWriter'
写入Verbose'正在处理文件...'
while(($ line = $ InFile.ReadLine())-ne $ null)
{
$ tmp = $ line -split $ SplitRegex |
ForEach-Object {
#处理每个项目
foreach($ Action中的$ scriptblock){
。 $ scriptblock
}
#输出当前对象到管道
$ _
}
#将行写入新的CSV文件
$ OutFile.WriteLine($ tmp -join $ Separator)
}
写 - Verbose完成处理文件:$($ _。FullName)
Write-Verbose处理的文件另存为:$ ($ OutFile.BaseStream.Name)
#关闭打开的文件和清理对象
$ OutFile.Flush()
$ OutFile.Close()
$ OutFile .Dispose()
$ InFile.Close()
$ InFile.Dispose()
}
null
values and adds milliseconds to datetime
values:DATE_COLUMN,DATETIME_COLUMN,TEXT_COLUMN,NUMBER_COLUMN
2015-05-01,2015-05-01 23:00:00.000,LOREM IPSUM,10.3456
NULL,NULL,NULL,0
datetime
values with fractional seconds correctly, which lead to confusion amongst the customers ('What happened to the date field that I requested?') and more work for me (having to convert the CSV to XLSX and format the columns correctly prior to distribution).NULL
and .000
values:DATE_COLUMN,DATETIME_COLUMN,TEXT_COLUMN,NUMBER_COLUMN
2015-05-01,2015-05-01 23:00:00,LOREM IPSUM,10.3456
,,,0
Function Invoke-CsvCleanser {
[CmdletBinding()]
Param(
[parameter(Mandatory=$true)]
[String]
$Path,
[switch]
$Nulls,
[switch]
$Milliseconds
)
PROCESS {
# open the file
$data = Import-Csv $path
# process each row
$data | Foreach-Object {
# process each column
Foreach ($property in $_.PSObject.Properties) {
# if column contains 'NULL', replace it with ''
if ($Nulls -and ($property.Value -eq 'NULL')) {
$property.Value = $property.Value -replace 'NULL', ''
}
# if column contains a date/time value, remove milliseconds
elseif ( $Milliseconds -and (isDate($property.Value)) ) {
$property.Value = $property.Value -replace '.000', ''
}
}
}
# save file
$data | Export-Csv -Path $Path -NoTypeInformation
}
}
function IsDate($object) {
[Boolean]($object -as [DateTime])
}
PS> Invoke-CsvCleanser 'C:\Users\Foobar\Desktop\0000.csv' -Nulls -Milliseconds
Invoke-CsvCleanser
would make use of the pipeline.Import-CSV
always loads entire file in memory, so it's slow. Here is modified script from my answer to this question: CSV formatting - strip qualifier from specific fields.NULL
s and milliseconds are matched\replaced using regex. Script is able to mass-convert CSV's.Invoke-CsvCleanser.ps1
. It accepts following arguments:
[System.Text.Encoding]::GetEncodings()
NULL
strings will be stripped from values.000
strings will be stripped from valuesWrite-Verbose
messages.C:\CSVs_are_here
, strip NULLs and milliseconds, save processed CSVs to the folder C:\Processed_CSVs
, be verbose:.\Invoke-CsvCleanser.ps1 -InPath 'C:\CSVs_are_here' -OutPath 'C:\Processed_CSVs' -Nulls -Milliseconds -Verbose
Invoke-CsvCleanser.ps1
script:Param
(
[Parameter(ValueFromPipelineByPropertyName = $true)]
[ValidateScript({
if(!(Test-Path -LiteralPath $_ -PathType Container))
{
throw "Input folder doesn't exist: $_"
}
$true
})]
[ValidateNotNullOrEmpty()]
[string]$InPath = (Get-Location -PSProvider FileSystem).Path,
[Parameter(Mandatory = $true, ValueFromPipelineByPropertyName = $true)]
[ValidateScript({
if(!(Test-Path -LiteralPath $_ -PathType Container))
{
try
{
New-Item -ItemType Directory -Path $_ -Force
}
catch
{
throw "Can't create output folder: $_"
}
}
$true
})]
[ValidateNotNullOrEmpty()]
[string]$OutPath,
[Parameter(ValueFromPipelineByPropertyName = $true)]
[string]$Encoding = 'Default',
[switch]$Nulls,
[switch]$Milliseconds,
[switch]$DoubleQuotes
)
if($Encoding -eq 'Default')
{
# Set default encoding
$FileEncoding = [System.Text.Encoding]::Default
}
else
{
# Try to set user-specified encoding
try
{
$FileEncoding = [System.Text.Encoding]::GetEncoding($Encoding)
}
catch
{
throw "Not valid encoding: $Encoding"
}
}
$DQuotes = '"'
$Separator = ','
# http://stackoverflow.com/questions/15927291/how-to-split-a-string-by-comma-ignoring-comma-in-double-quotes
$SplitRegex = "$Separator(?=(?:[^$DQuotes]|$DQuotes[^$DQuotes]*$DQuotes)*$)"
# Regef to match NULL
$NullRegex = '^NULL$'
# Regex to match milliseconds: 23:00:00.000
$MillisecondsRegex = '(\d{2}:\d{2}:\d{2})(\.\d{3})'
Write-Verbose "Input folder: $InPath"
Write-Verbose "Output folder: $OutPath"
# Iterate over each CSV file in the $InPath
Get-ChildItem -LiteralPath $InPath -Filter '*.csv' |
ForEach-Object {
Write-Verbose "Current file: $($_.FullName)"
$InFile = New-Object -TypeName System.IO.StreamReader -ArgumentList (
$_.FullName,
$FileEncoding
) -ErrorAction Stop
Write-Verbose 'Created new StreamReader'
$OutFile = New-Object -TypeName System.IO.StreamWriter -ArgumentList (
(Join-Path -Path $OutPath -ChildPath $_.Name),
$false,
$FileEncoding
) -ErrorAction Stop
Write-Verbose 'Created new StreamWriter'
Write-Verbose 'Processing file...'
while(($line = $InFile.ReadLine()) -ne $null)
{
$tmp = $line -split $SplitRegex |
ForEach-Object {
# Strip surrounding quotes
if($DoubleQuotes)
{
$_ = $_.Trim($DQuotes)
}
# Strip NULL strings
if($Nulls)
{
$_ = $_ -replace $NullRegex, ''
}
# Strip milliseconds
if($Milliseconds)
{
$_ = $_ -replace $MillisecondsRegex, '$1'
}
# Output current object to pipeline
$_
}
# Write line to the new CSV file
$OutFile.WriteLine($tmp -join $Separator)
}
Write-Verbose "Finished processing file: $($_.FullName)"
Write-Verbose "Processed file is saved as: $($OutFile.BaseStream.Name)"
# Close open files and cleanup objects
$OutFile.Flush()
$OutFile.Close()
$OutFile.Dispose()
$InFile.Close()
$InFile.Dispose()
}
DATE_COLUMN,DATETIME_COLUMN,TEXT_COLUMN,NUMBER_COLUMN
2015-05-01,2015-05-01 23:00:00,LOREM IPSUM,10.3456
,,,0
Action
parameter in the order you want them to execute.NULL
s, strip milliseconds and then strip double quotes..\Invoke-CsvCleanser.ps1 -InPath 'C:\CSVs_are_here' -OutPath 'C:\Processed_CSVs' -Action {$_ = $_ -replace '^NULL$', '' }, {$_ = $_ -replace '(\d{2}:\d{2}:\d{2})(\.\d{3})', '$1'}, {$_ = $_.Trim('"')}
Invoke-CsvCleanser.ps1
with "lambdas":Param
(
[Parameter(ValueFromPipelineByPropertyName = $true)]
[ValidateScript({
if(!(Test-Path -LiteralPath $_ -PathType Container))
{
throw "Input folder doesn't exist: $_"
}
$true
})]
[ValidateNotNullOrEmpty()]
[string]$InPath = (Get-Location -PSProvider FileSystem).Path,
[Parameter(Mandatory = $true, ValueFromPipelineByPropertyName = $true)]
[ValidateScript({
if(!(Test-Path -LiteralPath $_ -PathType Container))
{
try
{
New-Item -ItemType Directory -Path $_ -Force
}
catch
{
throw "Can't create output folder: $_"
}
}
$true
})]
[ValidateNotNullOrEmpty()]
[string]$OutPath,
[Parameter(ValueFromPipelineByPropertyName = $true)]
[string]$Encoding = 'Default',
[Parameter(Mandatory = $true, ValueFromPipelineByPropertyName = $true)]
[scriptblock[]]$Action
)
if($Encoding -eq 'Default')
{
# Set default encoding
$FileEncoding = [System.Text.Encoding]::Default
}
else
{
# Try to set user-specified encoding
try
{
$FileEncoding = [System.Text.Encoding]::GetEncoding($Encoding)
}
catch
{
throw "Not valid encoding: $Encoding"
}
}
$DQuotes = '"'
$Separator = ','
# http://stackoverflow.com/questions/15927291/how-to-split-a-string-by-comma-ignoring-comma-in-double-quotes
$SplitRegex = "$Separator(?=(?:[^$DQuotes]|$DQuotes[^$DQuotes]*$DQuotes)*$)"
Write-Verbose "Input folder: $InPath"
Write-Verbose "Output folder: $OutPath"
# Iterate over each CSV file in the $InPath
Get-ChildItem -LiteralPath $InPath -Filter '*.csv' |
ForEach-Object {
Write-Verbose "Current file: $($_.FullName)"
$InFile = New-Object -TypeName System.IO.StreamReader -ArgumentList (
$_.FullName,
$FileEncoding
) -ErrorAction Stop
Write-Verbose 'Created new StreamReader'
$OutFile = New-Object -TypeName System.IO.StreamWriter -ArgumentList (
(Join-Path -Path $OutPath -ChildPath $_.Name),
$false,
$FileEncoding
) -ErrorAction Stop
Write-Verbose 'Created new StreamWriter'
Write-Verbose 'Processing file...'
while(($line = $InFile.ReadLine()) -ne $null)
{
$tmp = $line -split $SplitRegex |
ForEach-Object {
# Process each item
foreach($scriptblock in $Action) {
. $scriptblock
}
# Output current object to pipeline
$_
}
# Write line to the new CSV file
$OutFile.WriteLine($tmp -join $Separator)
}
Write-Verbose "Finished processing file: $($_.FullName)"
Write-Verbose "Processed file is saved as: $($OutFile.BaseStream.Name)"
# Close open files and cleanup objects
$OutFile.Flush()
$OutFile.Close()
$OutFile.Dispose()
$InFile.Close()
$InFile.Dispose()
}