AWK为逗号和引号设置了多个定界符 [英] AWK set multiple delimiters for comma and quotes with commas

查看:35
本文介绍了AWK为逗号和引号设置了多个定界符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个CSV文件,其中各列用逗号分隔,并用带有逗号的文本数据列引用.

I have a CSV file where columns are comma separated and columns with textual data that have commas are quoted.

有时,在带引号的文本中也存在引号,表示诸如英寸之类的东西会导致更多的引号.

Sometimes, within quoted text there also exist quotes to mean things like inches resulting in more quotes.

没有嵌入逗号的文本数据没有引号.

Textual data without embedded commas do not have quotes.

例如:

A,B,C
1,"hello, how are you",hello
2,car,bike
3,13.3 inch tv,"tv 13.3"""

我如何使用awk打印我应该获得的每一行的列数

How do i use awk to print the number of columns for each row of which i should get

3
3
3

我曾想过使用 $ awk -F'[,]',但即时通讯获得的列数超过了该数.

I thought of using $awk -F'[,"]' but im getting way more columns than there is.

帮助表示赞赏.

推荐答案

GNU awk具有扩展功能,可以处理此类有问题的CSV文件.让我们首先考虑一下没有在引号中嵌入引号的情况:

GNU awk has an extension to handle just such problematic CSV files. Let's consider this first without quotes embedded within quotes:

$ awk -v FPAT="([^,]+)|(\"[^\"]+\")" '{print NF}' file.csv
3
3
3

工作原理

代替使用分隔符来定义字段,使用 FPAT 可以使我们通过正则表达式来定义字段.在这种情况下,我们将字段定义为不带逗号的([^,] +)或用双引号引起来的字段(\"[^ \] + \").

How it works

Instead of defining fields by a separator, FPAT allows us to define a field by a regular expression. In this case, we define a field as either something that has no commas, ([^,]+), or as something that is surrounded by double quotes, (\"[^\"]+\").

有关更多详细信息,请参见 GNU手册.

For more detail, see the GNU manual.

在问题的修订版中,我们有以下一行:

In the revised version of the question, we have the line:

3,13.3 inch tv,"tv 13.3"""

在这种扩展情况下,如果双引号本身是双引号,则可以将其包含在双引号字段中.为此,我们按照lcd047的建议扩展了正则表达式,以允许在字段中任意数量的此类double-double-quotes:

In this extended case, double quotes can be included within the double quoted field if they themselves are doubled. To allow for this we extend the regex, as per lcd047's suggestion, to allow for an arbitrary number of such doubled-double-quotes within a field:

 awk -v FPAT="([^,]+)|(\"([^\"]|\"\")+\")"  '{print NF}' file.csv

这篇关于AWK为逗号和引号设置了多个定界符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆