从制表符分隔的文件中提取列 [英] Extract columns from tab separated file

查看:93
本文介绍了从制表符分隔的文件中提取列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文件(data.rdb),格式如下:

I have a file (data.rdb) with the following format:

col1    col2    col3    col4    col5    col6    col7
aaa1    bbb1    ccc1    ddd1    eee1    fff1    ggg1
aaa2    bbb2    ccc2    ddd2    eee2    fff2    ggg2
aaa3    bbb3    ccc3    ddd3    eee3    fff3    ggg3

数据的某些属性:

  • 所有列都用制表符分隔
  • 列的宽度不同
  • 单元格的长度可能不同
  • 文件中的列将比显示的要多得多,并需要几百行
  • 我提供的列名只是通用的,真实名称可以是任何单词,没有制表符,空格或特殊字符.

我需要使用bash按名称提取某些列,例如col1col3col6,其中要选择的列来自定义为COLUMN_LIST=$@的shell变量,其中$@是参数传递给我的shell脚本.每次我调用脚本时,参数的数量和名称可能会更改.

I need to extract some of the columns by name using bash, e.g., col1, col3 and col6, where the columns to select come from a shell variable defined as COLUMN_LIST=$@ where $@ are the parameters passed to my shell script. The number and name of the parameters might change each time I call the script.

该脚本必须为bash,不能为python或类似版本.

The script needs to be in bash, cannot be python or similar.

有什么想法吗?我考虑过使用awk/gawk,但是我不知道如何通过列名进行选择.列顺序可能因文件而异.

Any ideas? I thought about using awk/gawk, but I do not know how to select by column name. The column order might change from file to file.

谢谢 豪尔赫

更新

由于某种原因,这些解决方案似乎都无法在我的真实数据文件上运行(即,我完全没有输出),因此我发布了其中一个的子集:

for some reason, none of these solutions seem to work on my real data files (i.e., I get no output at all), so I am posting a subset of one of those:

date    star    jdb texp
2013-11-22  epsInd   2400000.23551544   100.
2013-11-22  epsInd   2400000.23551544   100.
2013-11-22  epsInd   2400000.23551544   100.
2013-11-22  HD217987 2400000.23551544   900.
2013-11-22  TOI-134  2400000.23551544   900.
2013-11-22  tauCet   2400000.23551544   60. 
2013-11-22  BD+01316 2400000.23551544   300.
2013-11-22  BD+01316 2400000.23551544   300.
2013-11-22  BD+01316 2400000.23551544   300.
2013-11-22  BD+01316 2400000.23551544   300.

在这种情况下,我会对列star jdbtexp

in this case, I would be interested in columns star jdb and texp

更新2

我使用了@EdMorton的代码,这是结果:

I've used @EdMorton's code and this is the result:

date    star    jdb texp    date    star    jdb texp
2013-11-22  epsInd   2400000.23551544   100.    2013-11-22  epsInd   2400000.23551544   100.
2013-11-22  epsInd   2400000.23551544   100.    2013-11-22  epsInd   2400000.23551544   100.
2013-11-22  epsInd   2400000.23551544   100.    2013-11-22  epsInd   2400000.23551544   100.
2013-11-22  HD217987 2400000.23551544   900.    2013-11-22  HD217987 2400000.23551544   900.
2013-11-22  TOI-134  2400000.23551544   900.    2013-11-22  TOI-134  2400000.23551544   900.
2013-11-22  tauCet   2400000.23551544   60.     2013-11-22  tauCet   2400000.23551544   60. 
2013-11-22  BD+01316 2400000.23551544   300.    2013-11-22  BD+01316 2400000.23551544   300.
2013-11-22  BD+01316 2400000.23551544   300.    2013-11-22  BD+01316 2400000.23551544   300.
2013-11-22  BD+01316 2400000.23551544   300.    2013-11-22  BD+01316 2400000.23551544   300.
2013-11-22  BD+01316 2400000.23551544   300.    2013-11-22  BD+01316 2400000.23551544   300.

更新3

我最终使用了EdMorton的awk版本-主要是为了输出的灵活性-但经过修改后,我不希望它输出错误的列:

I ended up using EdMorton's version of awk -- mainly for flexibility on the output -- but with the modification that I do not want it to output wrong columns:

BEGIN {
    numCols = split(column_list,cols)
    OFS="\t"
}
{ sub(/\r$/,"") }
NR==1 {
    for (fldNr=1; fldNr<=NF; fldNr++) {
        f[$fldNr] = fldNr
    }
}
{
    for (colNr=1; colNr<=numCols; colNr++) {
        colName = cols[colNr]
        colVal  = (colName in f ? $(f[colName]) : "")
        printf "%s%s", colVal, (colNr<numCols ? OFS : ORS)
    }
}

我得到的主要问题是标题行没有制表符分隔,因此列细分无法正常工作.识别制表符/非制表符的简单方法:

The main issue I got was that the header line was not tab separated and as such column breakdown did not work. An easy way to spot tab/non-tab characters:

tr $'\t' '#' < data.rdb | head -2

给出了我的一个测试文件:

which gave on one of my test files:

date    star    jdb texp
2013-11-22#epsInd#2400000.23551544#100.

推荐答案

处理此问题的最佳方法是创建一个将列标题字符串(即字段名称)映射到该字段的数组(下面的f[])读取标题行时输入数字,然后从此开始按其名称访问字段.

The best way to deal with this is to create an array (f[] below) that maps the column header strings (i.e. the field names) to the field numbers when reading the header line and then just access the fields by their names from then on.

已更新,以防止调用方要求输入不存在的列名和DOS行尾:

Updated to protect against the caller asking for a column name that doesn't exist and against DOS line endings:

$ cat tst.awk
BEGIN {
    numCols = split(column_list,cols)
    FS=OFS="\t"
}
{ sub(/\r$/,"") }
NR==1 {
    for (fldNr=1; fldNr<=NF; fldNr++) {
        f[$fldNr] = fldNr
    }
}
{
    for (colNr=1; colNr<=numCols; colNr++) {
        colName = cols[colNr]
        colVal  = (colName in f ? $(f[colName]) : (NR>1 ? "N/A" : colName))
        printf "%s%s", colVal, (colNr<numCols ? OFS : ORS)
    }
}

$ awk -v column_list='col1 col3 col6' -f tst.awk data.rdb
col1    col3    col6
aaa1    ccc1    fff1
aaa2    ccc2    fff2
aaa3    ccc3    fff3

$ awk -v column_list='col1 col3 col6 bob' -f tst.awk data.rdb
col1    col3    col6    bob
aaa1    ccc1    fff1    N/A
aaa2    ccc2    fff2    N/A
aaa3    ccc3    fff3    N/A

请注意,采用上述方法,您可以根据需要更改列的顺序,而不仅仅是按原始顺序打印它们:

Note that with the above approach if you like you can change the order of the columns for output, not just print them in their original order:

$ awk -v column_list='col5 col2 col4' -f tst.awk data.rdb
col5    col2    col4
eee1    bbb1    ddd1
eee2    bbb2    ddd2
eee3    bbb3    ddd3

这篇关于从制表符分隔的文件中提取列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆