删除伪逗号 [英] Remove spurious commas

查看:162
本文介绍了删除伪逗号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

一个白痴的客户正在生成csv文件,但有一个字段有一个额外的逗号(在描述字段)。



有一个整洁的正则表达式来找到这些坏记录,用其他的替换额外的逗号。 SED命令行会很好。



示例:

  A ,B,C,这是一个描述D,E 
F,G,H,这是一个用逗号(,)描述的D,E

我需要一个SED,它可以知道行中有太多的逗号,并从字段4中删除多余的逗号。

$ b $



添加



我不反对一个解决方案,只是删除一个虚假的逗号,我必须运行多次。

解决方案

解决方案1:单行删除



这里你使用SED单行:

  sed -r'/([^,], ],[^,],(*)(,。+,。+)/ \1' )(。*)(,。+,。+)/ \2/'<<< $ myInput | sed's /,// g')'\3 /'< ; $ myInput 

您必须替换<< $ myInput 与您的实际输入是什么。

在使用CSV时,您可能需要调整(两种情况)正则表达式匹配CSV的每一行

如果您的前三个和最后两个字段大于一个字符,则替换 [^,] [^ ,] *



说明

我们使用这个正则表达式

  /([^,],[^,],[^,], / 

捕获第一个( F,G,H,),第二个(。* )和最后部分(,D,E 我们。

第一个和第三个捕获组将保持不变,而第二个捕获组将被替换。

对于替换,我们调用 sed 第二(实际上是第三)时间。首先,我们只捕获第二个组,第二个我们只替换每个(只在捕获组中!)。



验证



当然,如果没有不需要的逗号,什么都不会被替换:






解决方案2:整个文件, -line,delete



如果只想指定文件替换应该发生在您可以使用的文件的每一行

  do sed -r'/([^,],[^,],[^,],)(.*)(,.+,.+)/ \1'$(sed -r's / ([^,],[^,],[^,],) / g')'\3 /'<< $ line;完成input.txt 

其中 input.txt 最后是你的文件。

我只是在内使用SED命令,而 -loop读取文本的每一行。这是必要的,因为你必须跟踪你正在阅读的行,因为你在同一个输入上调用 sed 两次。







解决方案3:整个文件,在中包含字段



a href =http://stackoverflow.com/users/531954/ukasz-l> @ŁukaszL. 在对OP的评论中指出,根据 RFC1480 ,它描述了CSV文件的格式,最好在中包含逗号的字段

这比其他解决方案更简单:

  sed  - r'/([^,],[^,],[,,],,,.*)(,*,,*)/ \1\2\3 /'input.txt 

再次,我们有三个捕获组。这允许我们简单地将!包装起来。




An idiot customer is generating csv files but one field sometimes has extra commas in (a description field).

Is there a tidy regex to find these bad records and replace the extra commas with something else. A SED command line would be fine.

Example:

A,B,C,This is a description,D,E
F,G,H,This is a description with a comma (,) in it,D,E

I need a SED that can tell that there are too many commas in the line and remove the extra comma from field 4.

We do not have the luxury of telling stupid customer to change their code.

Added

I would not object to a solution that just removes one spurious comma that I have to run multiple times.

解决方案

Solution 1: single-line, delete ,

Here you go with an SED one-liner:

sed -r 's/([^,],[^,],[^,],)(.*)(,.+,.+)/\1'"$(sed -r 's/([^,],[^,],[^,],)(.*)(,.+,.+)/\2/' <<< $myInput | sed 's/,//g')"'\3/' <<< $myInput

You have to replace <<< $myInput with whatever your actual input is.
As you're working with CSVs you may have to tweak (both occurences of) the regex to match on each line of your CSV sheet.
In case your first three and last two fields are bigger than one char replace [^,] with [^,]*.

Explanation:
We use this regex

/([^,],[^,],[^,],)(.*)(,.+,.+)/

which captures the first (F,G,H,), second (.*) and last part (,D,E) of the string for us.
The first and third capture group will be unchanged, while the second is going to be substitued.
For the substitution we call sed a second (and actually third) time. First we capture only the second group, second we replace every , with nothing (only in the capture group!).

Proof:

Of course, if there is no unwanted comma, nothing gets replaced:


Solution 2: whole file, line-by-line, delete ,

If you want to specify only a file and the replacement should happen for each line of the file you can use

while read line; do sed -r 's/([^,],[^,],[^,],)(.*)(,.+,.+)/\1'"$(sed -r 's/([^,],[^,],[^,],)(.*)(,.+,.+)/\2/' <<< $line | sed 's/,//g')"'\3/' <<< $line; done < input.txt

where input.txt at the end is - obviously - your file.
I just use the SED-command from above within a while-loop which reads each line of the text. This is necessary because you have to keep track of the line you're reading, as you're calling sed two times on the same input.


Solution 3: whole file, enclose field in "

As @Łukasz L. pointed out in the comments to the OP, according to the RFC1480, which describes the format for CSV-files it would be better to enclose fields which contain a comma in ".
This is more simple than the other solutions:

sed -r 's/([^,],[^,],[^,],)(.*)(,.*,.*)/\1"\2"\3/' input.txt

Again we have the three capturing groups. This allows us to simply wrap the second group in "!

这篇关于删除伪逗号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆