处理带单引号的字符串和带单引号的字符串中的偶数逗号或单引号的CSV文件 [英] Handle a CSV file with single quoted string and occasional comma or single quote within the single quoted string

查看:125
本文介绍了处理带单引号的字符串和带单引号的字符串中的偶数逗号或单引号的CSV文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个CSV文件,其中的文本列用单引号引起来,而其他非文本列则没有引号。在这些单引号中,文本列可能带有逗号或单引号。我在网上找到了一个脚本,但是它不能处理这种情况。

I have a CSV file with text columns quoted in single quote around it and other non text columns don't have a quote around it.The text columns might have a comma or single quote within these single quoted text columns. I found a script online but it doesn't handle this kind of situation.

在PowerShell中可以处理此脚本吗?

Is there way to handle this in PowerShell?

示例:

123,678.89,'hello there1', 'xyz1@gmail.com', 'abc,nds'\n 
123,678.89,'hello 'there2', 'xyz2@gmail.com', 'akiu-'nds'\n

输出:

123,678.89|hello there1|xyz1@gmail.com|abc,nds \n 
123,678.89|hello 'there2|xyz2@gmail.com|akiu-'nds \n

示例2:

123,6272,678.89 ,,'hello, there1',,,'abc1','tw,es',,'xyz1@gmail.com',,,,'abc,nds1'\n
124,8272,928.89 ,,, ,'hello'there2',','abc2','twes','xyz2@gmail.com','biej','abc'nds2'\n
125,9272, 328.89,'hello'there3',,'abc3',',outyi',,,'xyz3@gmail.com',,,,,'ahct','abc'nds3'\n

Example 2:
123,6272,678.89,,,'hello ,there1',,,,'abc1','tw,es',,'xyz1@gmail.com',,,,,,'abc,nds1'\n 124,8272,928.89,,,,'hello 'there2',,,'abc2','twes',,,'xyz2@gmail.com',,'biej',,,'abc'nds2'\n 125,9272,328.89,,'hello 'there3',,'abc3',', outyi',,,,'xyz3@gmail.com',,,,,,'ahct','abc'nds3'\n

输出:

123 | 6272 | 678.89 ||| he llo,there1 |||| abc1 | tw,es ||| xyz1@gmail.com |||||| abc,nds1\n 124 | 8272 | 928.89 |||| hello'there2 |||| abc2 | twes | ||xyz2@gmail.com||biej|||abc'nds2\n
125 | 9272 | 328.89 || hello'there3 || abc3 |,outyi ||||| xyz3@gmail.com || |||| ahct | abc'nds3\n

123|6272|678.89|||hello ,there1||||abc1|tw,es||xyz1@gmail.com||||||abc,nds1\n 124|8272|928.89||||hello 'there2|||abc2|twes|||xyz2@gmail.com||biej|||abc'nds2\n 125|9272|328.89||hello 'there3||abc3|, outyi||||xyz3@gmail.com||||||ahct|abc'nds3\n

推荐答案

类似于基兰的答案。有几件事需要更改,因此我认为没有一个适合所有解决方案的尺寸。我们需要将这两个更改链接在一起。首先是实际上是定界符的逗号,其次是行字符序列的特殊结尾。

Similar to Kiran's answer. There are a couple of things that need to change so I don't think that there is a one size fits all solution. We need to chain these couple of changes. First being the commas that are actually delimiters and second the special end of line character sequence.

$path = "c:\temp\file.csv"
$newDelimiter = "|"
(Get-Content $path) -replace "'\s*?,\s?'|,\s?'|'\s?,",$newDelimiter -replace "'\s*?\\n$","\n" | Set-Content $path

我有一个 regex101链接,其中会进行详细说明。进行更大工作的正则表达式是第一个具有三个潜在替代匹配项的。这有效地忽略了引号本身。如果有引号和逗号组合的数据,那么我认为将在没有更多信息的情况下对此进行编程。

I have a regex101 link that explains with more detail. The regex doing the greater work is the first with three potential alternate matches. This effectively ignores quotes that are off by themselves. If there is data that has a quote and comma combo then I think it would be following to program this without more information.


  • '\s * ?, \s?':用逗号括起来用引号括起来,可选地用变体空白括起来。

  • ,\s?':带可选空格和引号的逗号

  • '\s?,:使用可选空格后跟逗号的报价

  • '\s*?,\s?': Comma enclosed in quotes optionally surrounded by variant white-space.
  • ,\s?': Comma with optional space followed by a quote
  • '\s?,: Quote with optional space followed by a comma

因此,上述任何组的匹配项都将替换为 $ newDelimiter 。第二个正则表达式只是在查找'\n $ ,同时考虑报价和 \n 发生在行尾。这就是删除最后一个单引号的方式。

So a match of any of the above groups would be replaced with $newDelimiter. Second regex is just looking for '\n$ while accounting for potential optional white-space between the quote and \n that happens at the end of the line. This is how the last single quote is removed.

这篇关于处理带单引号的字符串和带单引号的字符串中的偶数逗号或单引号的CSV文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆