CSV格式 - 特定字段的条形限定符 [英] CSV formatting - strip qualifier from specific fields

查看:379
本文介绍了CSV格式 - 特定字段的条形限定符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我收到的CSV输出使用作为每个字段的文本限定符。我正在寻找一个优雅的解决方案来重新格式化这些字段,以便只有特定(字母数字字段)有这些限定词。



我收到的例子:

 TRI-MOUNTAIN / MOUNTAI,F258273,41016053 ,A,10/16/14,3,1,Recruit-Navy,XL#28-75,13.25,13.25



我想要的输出将是:

  TRI-MOUNTAIN / MOUNTAI,F258273,41016053,A,10/16 / 14,3,1,Recruit-Navy,XL#28-75,13.25,13.25 

非常感谢任何建议或帮助。



五行的示例文件:

 TRI-MOUNTAIN / MOUNTAI,F258273,41016053 10/16/14,,1,Recruit-Navy,XL#28-75,13.25,13.25
TRI-MOUNTAIN / MOUNTAI,F258273 41016053,,10/16/14,,1,High Peak-Navy,XL#21-18,36.75,36.75
TRI-MOUNTAIN /MOUNTAI\",\"F257186\",\"Z1023384\",\"\",\"10/15/14\",\"\",\"1\",\"Patriot-Red,L#26-35\",\"25.50\",\"25.50
TRI-MOUNTAIN / MOUNTAI,F260780,Z1023658,10/20/14,1,Exeter-Red / Gray,S#23-52 ,19.75,19.75
TRI-MOUNTAIN / MOUNTAIN,F260780,Z1023658,,10/20/14,,1 /Gray,XL#23-56\",\"19.75\",\"19.75

请注意,这是

解决方案

由于您没有指定操作系统或语言,是PowerShell版本。



我已经尝试使用 Import-CSV ,因为您的非标准CSV档案切换到原始文件处理。



正则表达式拆分CSV是从这个问题:如何用逗号分隔一个字符串,忽略逗号在双引号中



将此脚本保存为 StripQuotes.ps1 。它接受以下参数:




  • InPath 如果未指定,则使用当前目录。

  • OutPath

  • 编码:如果未指定,脚本将使用系统的当前ANSI代码页来读取文件。您可以在PowerShell控制台中获取其他有效的编码,如下所示: [System.Text.Encoding] :: GetEncodings()

  • 详细脚本会通过 Write-Verbose 邮件告诉您发生了什么。



示例(从PowerShell控制台运行)。



处理文件夹中的所有CSV C:\CSVs_are_here ,将处理的CSV文件保存到文件夹 C:\Processed_CSVs ,即verbose:

  .\StripQuotes.ps1 -InPath'C:\CSVs_are_here'-OutPath'C:\ Processed_CSVs'-Verbose 

StripQuotes.ps1 脚本:

  Param 

[ParameterFromPipelineByPropertyName = $ true) ]
[ValidateScript({
if(!(Test-Path -LiteralPath $ _ -PathType Container))
{
throw输入文件夹不存在:$ _
}
$ true
})]
[ValidateNotNullOrEmpty()]
[string] $ InPath =(Get-Location -PSProvider FileSystem).Path,

[Parameter(Mandatory = $ true,ValueFromPipelineByPropertyName = $ true)]
[ValidateScript({
if(!(Test-Path -LiteralPath $ _ -PathType Container))
{
try
{
New-Item -ItemType Directory -Path $ _ -Force
}
catch
{
throw无法创建输出文件夹:$ _
}
}
$ true
})]
[ValidateNotNullOrEmpty()]
[string] $ OutPath,

[Parameter(ValueFromPipelineByPropertyName = $ true)]
[string] $ Encoding ='Default'



if ($ Encoding -eq'Default')
{
#设置默认编码
$ FileEncoding = [System.Text.Encoding] ::默认
}
else
{
#尝试设置用户指定的编码
try
{
$ FileEncoding = [System.Text.Encoding] :: GetEncoding($ Encoding)
}
catch
{
throw无效编码:$ Encoding
}
}

$ DQuotes =''
$ Separator =','
#http://stackoverflow.com/questions/15927291/how-to-split-a-string-by-comma-ignoring-comma-in-double-quotes
$ SplitRegex =$ Separator(?=(?:[^ $ DQuotes] | $ DQuotes [^ $ DQuotes] * $ DQuotes)* $)
#匹配类别中的单个代码点信。
$ AlphaNumRegex ='\p {L}'

写入Verbose输入文件夹:$ InPath
Write-Verbose输出文件夹:$ OutPath

#遍历$ InPath中的每个CSV文件
Get-ChildItem -LiteralPath $ InPath -Filter'* .csv'|
ForEach-Object {
Write-Verbose当前文件:$($ _。FullName)
$ InFile = New-Object -TypeName System.IO.StreamReader -ArgumentList b $ _。FullName,
$ FileEncoding
)-ErrorAction Stop
Write-Verbose'创建新的StreamReader'

$ OutFile = New-Object -TypeName系统。 IO.StreamWriter -ArgumentList(
(Join-Path -Path $ OutPath -ChildPath $ _。Name),
$ false,
$ FileEncoding
) - ErrorAction Stop
Write-Verbose'创建新StreamWriter'

写入Verbose'正在处理文件...'
while(($ line = $ InFile.ReadLine())-ne $ null)
{
$ tmp = $ line -split $ SplitRegex |
ForEach-Object {
#剥去双引号,如果有
$ item = $ _。Trim($ DQuotes)

if($ _ -match $ AlphaNumRegex )
{
#如果字段至少有一个字母 - 换行引号
$ DQuotes + $ item + $ DQuotes
}
else
{
#否则,传递它
$ item
}
}
#将行写入新的CSV文件
$ OutFile.WriteLine($ tmp -
}

写入详细完成处理文件:$($ _。FullName)
Write-Verbose处理的文件另存为:$($ OutFile.BaseStream.Name)

#关闭打开的文件和清除对象
$ OutFile.Flush()
$ OutFile.Close()
$ OutFile.Dispose ()

$ InFile.Close()
$ InFile.Dispose()
}


I am sorry if this question as been asked before, but I couldn't find anything similar.

I am receiving CSV output that uses " as a text qualifier around every field. I am looking for an elegant solution to reformat these so that only specific (alphanumeric fields) have these qualifiers.

An example of what I am receiving:

"TRI-MOUNTAIN/MOUNTAI","F258273","41016053","A","10/16/14",3,"1","Recruit-Navy,XL#28-75","13.25","13.25"

My desired output would be this:

"TRI-MOUNTAIN/MOUNTAI","F258273",41016053,"A",10/16/14,3,1,"Recruit-Navy,XL#28-75",13.25,13.25

Any suggestions or assistance are greatly appreciated!

Per request below find the first five lines of the example file:

"TRI-MOUNTAIN/MOUNTAI","F258273","41016053","","10/16/14","","1","Recruit-Navy,XL#28-75","13.25","13.25"
"TRI-MOUNTAIN/MOUNTAI","F258273","41016053","","10/16/14","","1","High Peak-Navy,XL#21-18","36.75","36.75"
"TRI-MOUNTAIN/MOUNTAI","F257186","Z1023384","","10/15/14","","1","Patriot-Red,L#26-35","25.50","25.50"
"TRI-MOUNTAIN/MOUNTAI","F260780","Z1023658","","10/20/14","","1","Exeter-Red/Gray,S#23-52","19.75","19.75"
"TRI-MOUNTAIN/MOUNTAI","F260780","Z1023658","","10/20/14","","1","Exeter-White/Gray,XL#23-56","19.75","19.75"

Note that this is only an example and not all files will be for Tri-Mountain.

解决方案

Since you've not specified OS or language, here is the PowerShell version.

I've ditched my previous attempt to work with Import-CSV because of your non-standard CSV files and switched to raw file processing. Should be significantly faster too.

Regex to split CSV is from this question: How to split a string by comma ignoring comma in double quotes

Save this script as StripQuotes.ps1. It accepts following arguments:

  • InPath: folder to read CSVs from. If not specified, the current directory is used.
  • OutPath: folder to save processed CSVs to. Will be created, if not exist.
  • Encoding: If not specified, script will use system's current ANSI code page to read the files. You can get other valid encodings for your system in PowerShell console like this: [System.Text.Encoding]::GetEncodings()
  • Verbose: script will tell you what's going on via Write-Verbose messages.

Example (run from the PowerShell console).

Process all CSVs in the folder C:\CSVs_are_here, save processed CSVs to the folder C:\Processed_CSVs, be verbose:

.\StripQuotes.ps1 -InPath 'C:\CSVs_are_here' -OutPath 'C:\Processed_CSVs' -Verbose

StripQuotes.ps1 script:

Param
(
    [Parameter(ValueFromPipelineByPropertyName = $true)]
    [ValidateScript({
        if(!(Test-Path -LiteralPath $_ -PathType Container))
        {
            throw "Input folder doesn't exist: $_"
        }
        $true
    })]
    [ValidateNotNullOrEmpty()]
    [string]$InPath = (Get-Location -PSProvider FileSystem).Path,

    [Parameter(Mandatory = $true, ValueFromPipelineByPropertyName = $true)]
    [ValidateScript({
        if(!(Test-Path -LiteralPath $_ -PathType Container))
        {
            try
            {
                New-Item -ItemType Directory -Path $_ -Force
            }
            catch
            {
                throw "Can't create output folder: $_"
            }
        }
        $true
    })]
    [ValidateNotNullOrEmpty()]
    [string]$OutPath,

    [Parameter(ValueFromPipelineByPropertyName = $true)]
    [string]$Encoding = 'Default'
)


if($Encoding -eq 'Default')
{
    # Set default encoding
    $FileEncoding = [System.Text.Encoding]::Default
}
else
{
    # Try to set user-specified encoding
    try
    {
        $FileEncoding = [System.Text.Encoding]::GetEncoding($Encoding)
    }
    catch
    {
        throw "Not valid encoding: $Encoding"
    }
}

$DQuotes = '"'
$Separator = ','
# http://stackoverflow.com/questions/15927291/how-to-split-a-string-by-comma-ignoring-comma-in-double-quotes
$SplitRegex = "$Separator(?=(?:[^$DQuotes]|$DQuotes[^$DQuotes]*$DQuotes)*$)"
# Matches a single code point in the category "letter".
$AlphaNumRegex = '\p{L}'

Write-Verbose "Input folder: $InPath"
Write-Verbose "Output folder: $OutPath"

# Iterate over each CSV file in the $InPath
Get-ChildItem -LiteralPath $InPath -Filter '*.csv' |
    ForEach-Object {
        Write-Verbose "Current file: $($_.FullName)"
        $InFile = New-Object -TypeName System.IO.StreamReader -ArgumentList (
            $_.FullName,
            $FileEncoding
        ) -ErrorAction Stop
        Write-Verbose 'Created new StreamReader'

        $OutFile = New-Object -TypeName System.IO.StreamWriter -ArgumentList (
            (Join-Path -Path $OutPath -ChildPath $_.Name),
            $false,
            $FileEncoding
        ) -ErrorAction Stop
        Write-Verbose 'Created new StreamWriter'

        Write-Verbose 'Processing file...'
        while(($line = $InFile.ReadLine()) -ne $null)
        {
            $tmp = $line -split $SplitRegex |
                        ForEach-Object {
                            # Strip double quotes, if any
                            $item = $_.Trim($DQuotes)

                            if($_ -match $AlphaNumRegex)
                            {
                                # If field has at least one letter - wrap in quotes
                                $DQuotes + $item + $DQuotes
                            }
                            else
                            {
                                # Else, pass it as is
                                $item
                            }
                        }
            # Write line to the new CSV file
            $OutFile.WriteLine($tmp -join $Separator)
        }

        Write-Verbose "Finished processing file: $($_.FullName)"
        Write-Verbose "Processed file is saved as: $($OutFile.BaseStream.Name)"

        # Close open files and cleanup objects
        $OutFile.Flush()
        $OutFile.Close()
        $OutFile.Dispose()

        $InFile.Close()
        $InFile.Dispose()
    }

这篇关于CSV格式 - 特定字段的条形限定符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆