如何使用Shell从带引号逗号的CSV提取列? [英] How do I extract column from CSV with quoted commas, using the shell?

查看:540
本文介绍了如何使用Shell从带引号逗号的CSV提取列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个CSV文件,但与

I have a CSV file, but unlike in related questions, it has some columns containing double-quoted strings with commas, e.g.

foo,bar,baz,quux
11,"first line, second column",13.0,6
210,"second column of second line",23.1,5

(当然更长,并且引号的逗号不一定是1或0,也不是可预测的文本.)文本在双引号中也可能有(转义)双引号,或者没有双引号.对于通常被引用的字段,完全使用双引号.我们唯一可以做的假设是没有引号的换行符,因此我们可以使用\n轻松地分割行.

(of course it's longer, and the number of quoted commas is not necessarily one or 0, nor is the text predictable.) The text might also have (escaped) double-quotes within double-quotes, or not have double-quotes altogether for a typically-quoted field. The only assumption we can make is that there are no quoted newlines, so we can split lines trivially using \n.

现在,我想提取一个特定的列(例如,第三列)-例如,要打印在标准输出上,每行一个值.我不能简单地使用逗号作为字段定界符(因此,例如,使用cut);相反,我需要更复杂的东西.那会是什么?

Now, I'd like to extract a specific column (say, the third one) - say, to be printed on standard output, one value per line. I can't simply use commas as field delimiters (and thus, e.g., use cut); rather, I need to something more sophisticated. What could that be?

注意:我在Linux系统上使用bash.

Note: I'm using bash on a Linux system.

推荐答案

这是一个快速且肮脏的Python csvcut. Python csv已经了解各种CSV方言等的所有知识,因此您只需一个薄的包装纸即可.

Here is a quick and dirty Python csvcut. The Python csv library already knows everything about various CSV dialects etc so you just need a thin wrapper.

第一个参数应表示您希望提取的字段的索引,例如

The first argument should express the index of the field you wish to extract, like

csvcut 3 sample.csv

从CSV文件sample.csv(可能是带引号的)中提取第三列.

to extract the third column from the (possibly, quoted etc) CSV file sample.csv.

#!/usr/bin/env python3

import csv
import sys

writer=csv.writer(sys.stdout)
# Python indexing is zero-based
col = 1+int(sys.argv[1])
for input in sys.argv[2:]:
    with open(input) as handle:
        for row in csv.reader(handle): 
            writer.writerow(row[col])

要做的事:错误处理,提取多列. (本质上并不难;使用row[2:5]提取第3、4和5列;但是我懒得编写适当的命令行参数解析器.)

To do: error handling, extraction of multiple columns. (Not hard per se; use row[2:5] to extract columns 3, 4, and 5; but I'm too lazy to write a proper command-line argument parser.)

这篇关于如何使用Shell从带引号逗号的CSV提取列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆