当存在多个字段分隔符时,使用AWK忽略字段内的逗号 [英] Ignoring commas within fields using AWK when there are multiple field separators

查看:80
本文介绍了当存在多个字段分隔符时,使用AWK忽略字段内的逗号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 awk gawk 解析下面的CSV记录.

I want to parse CSV records like the one below with awk or gawk.

字段之间用逗号分隔,但是最后一个字段( $ 6 )很特殊,因为它实际上由子字段组成.这些子字段由#分隔,作为字段分隔符(或更准确地说,是.#").这本身不是问题:我可以使用 awk -F'(,)|(.#)'设置替代字段分隔符.

The fields are separated by commas but the last field ($6) is special because it really consists of subfields. These subfields are separated by # as the field separator (or, to be precise, ". # "). This in itself is not a problem: I can use awk -F'(,)|(. # )' to set alternative field separators.

但是,在最后一个字段中也有一些逗号,需要忽略.

However, there are stray commas in this last field as well that need to be ignored.

是否可以通过FPAT使用 awk 解决此问题?

Is there a way to solve this with awk, perhaps using FPAT?

样本记录:

  "http://publications.europa.eu/resource/cellar/3befa3c3-a9af-4dac-baa2-92e95cb6e3ab","http://publications.europa.eu/resource/cellar/3befa3c3-a9af-4dac-baa2-92e95cb6e3ab.0002","EU:C:1985:443","61984CJ0239","Gerlach","Judgment of the Court (Third Chamber) of 24 October 1985. # Gerlach & Co. BV, Internationale Expeditie, v Minister van Economische Zaken. # Reference for a preliminary ruling: College van Beroep voor het Bedrijfsleven - Netherlands. # Article 41 ECSC - Anti-dumping duties. # Case 239/84."

推荐答案

使用 gnu-awk 中的 FPAT 功能,您也许可以做到这一点.我们使用 FPAT 来匹配所有双引号字段或逗号分隔的字段.最后,我们使用/\分割最后一个字段.#/正则表达式模式.

Using FPAT feature in gnu-awk, you may be able to do this. We use FPAT to match all double quoted fields or comma separated fields. Finally we split on last field using /\. # / regex pattern.

s='"http://publications.europa.eu/resource/cellar/3befa3c3-a9af-4dac-baa2-92e95cb6e3ab","http://publications.europa.eu/resource/cellar/3befa3c3-a9af-4dac-baa2-92e95cb6e3ab.0002","EU:C:1985:443","61984CJ0239","Gerlach","Judgment of the Court (Third Chamber) of 24 October 1985. # Gerlach & Co. BV, Internationale Expeditie, v Minister van Economische Zaken. # Reference for a preliminary ruling: College van Beroep voor het Bedrijfsleven - Netherlands. # Article 41 ECSC - Anti-dumping duties. # Case 239/84."'

awk -v FPAT='"[^"]*"|[^,]+' '{
   # loop through all fields except last one
   for (i=1; i<NF; ++i)
      print i, $i
   # split last field using /\. # / regex and print each token
   for (j=1; j<split($NF, a, /\. # /); ++j)
      print i+j-1, a[j]
}' <<< "$s"

1 "http://publications.europa.eu/resource/cellar/3befa3c3-a9af-4dac-baa2-92e95cb6e3ab"
2 "http://publications.europa.eu/resource/cellar/3befa3c3-a9af-4dac-baa2-92e95cb6e3ab.0002"
3 "EU:C:1985:443"
4 "61984CJ0239"
5 "Gerlach"
6 "Judgment of the Court (Third Chamber) of 24 October 1985
7 Gerlach & Co. BV, Internationale Expeditie, v Minister van Economische Zaken
8 Reference for a preliminary ruling: College van Beroep voor het Bedrijfsleven - Netherlands
9 Article 41 ECSC - Anti-dumping duties

这篇关于当存在多个字段分隔符时,使用AWK忽略字段内的逗号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆