根据第一列合并几个制表符分隔文件的某些列 [英] Combining certain columns of several tab-delimited files based on first column

查看:65
本文介绍了根据第一列合并几个制表符分隔文件的某些列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

inFile中的第一列包含的字符串不一定存在于所有inFile中

1st column in inFile contains a string not necessarily present in all inFiles

第2列和第7列包含Title#字符串

2nd and 7th columns in each inFile contains the Title# strings

使用AWK,我无法正确地将其拼凑在一起.我对描述性变量的使用将有望帮助阐明我正在尝试做的事情.这些是我认为我需要的组件:

Using AWK, I cannot piece this together correctly. My use of descriptive variables will hopefully help clarify what I'm trying to do. These are components I think I need:

  1. 制表符分隔的输入文件:-F'\t'
  2. 增加第一列中的字符串,但仅将每个名称"添加一次到"1stColumnNames"中:!1stColumnNames[$1]++ { name[++i] = $1 }
  3. 为每个.tsv文件创建一个新索引以存储每个文件的值,以避免覆盖每列的值:!r[FILENAME]++ { ++argind }
  4. 在每个文件的第二列和第七列中存储相应的列值:{ 2ndColumnVals[$1, argind] = $2 } { 7thColumnVals[$1, argind] = $7 }
  5. 打印所有带有关联的2ndColumnVal和7thColumnVal的1stColumnName,包括其标题'Title1','Title2','Title3'等.:?????
  6. 对于特定的2ndColumnVals或7thColumnVals为空的
  7. 索引值,打印为Mtee:?????
  8. 对当前工作目录中的所有.tsv文件执行此操作,然后输出新的tsv文件:*.tsv > outFile.tsv
  1. tab-separated input files: -F'\t'
  2. increment the strings in the 1st column, but only add each 'name' once to the '1stColumnNames': !1stColumnNames[$1]++ { name[++i] = $1 }
  3. make a new index for each .tsv file to store values for each file to avoid overwriting each column's values: !r[FILENAME]++ { ++argind }
  4. store corresponding column values in 2nd and 7th columns for each file: { 2ndColumnVals[$1, argind] = $2 } { 7thColumnVals[$1, argind] = $7 }
  5. print all 1stColumnNames with associated 2ndColumnVals and 7thColumnVals, including their headers 'Title1' 'Title2' 'Title3' etc. : ?????
  6. index values that were empty for a particular 2ndColumnVals or 7thColumnVals, print as Mtee: ?????
  7. do this for all .tsv files in the current working directory and ouput a new tsv file: *.tsv > outFile.tsv

示例文件

inFile1.tsv

Names   Title1  Title2
AAAA    1111    123456
BBBBB   1111    123456
CCC     1111    123456

inFile2.tsv

Names   Title3  Title4
BBBBB   2222    789456
DDDDD   2222    789456
EEEE    2222    789456

inFile3.tsv

Names   Title5  Title6
AAAA    3333    987654
CCC     3333    987654
EEEE    3333    987654

outFile123.tsv

Names   Title1  Title2  Title3  Title4  Title5  Title6
AAAA    1111    123456  Mtee    Mtee    3333    987654  
BBBBB   1111    123456  2222    789456  Mtee    Mtee
CCC     1111    123456  Mtee    Mtee    3333    987654
DDDDD   Mtee    Mtee    2222    789456  Mtee    Mtee
EEEE    Mtee    Mtee    2222    789456  3333    987654







GNU Awk 4.0.1位于/usr/bin/awk中,因此我制作了该文件并在3个输入文件所在的相同工作目录中执行了该文件:

GNU Awk 4.0.1 is located in /usr/bin/awk , so I made this file and executed it in the same working directory where the 3 input files are located:

#### Example Usage:  script1.sh inFile1.tsv inFile2.tsv inFile3.tsv > outFile123.tsv

awk -F'\t' '
FNR==1 { ++numFiles}
!seen[$1]++ { keys[++numKeys] = $1 }
{ a[$1,numFiles] = $2 FS $3 }
END {
    for (keyNr=1; keyNr<=numKeys; keyNr++) {
        key = keys[keyNr]
        printf "%s", key
        for (fileNr=1;fileNr<=numFiles;fileNr++) {
            printf "\t%s", ((key,fileNr) in a ? a[key,fileNr] : "Mtee\tMtee")
        }
        print ""
    }
}
' "$@"

运行awk -F script1.awk inFile1.tsv inFile2.tsv inFile3.tsv > outFile123.tsv会显示以下错误消息:

Running awk -F script1.awk inFile1.tsv inFile2.tsv inFile3.tsv > outFile123.tsv prints the follow error messages:

awk: cmd. line:1: inFile1.tsv

awk: cmd. line:1: ^ syntax error







#!/usr/bin/awk -f
#### named as script2.awk
#### Example Usage:  awk -f script2.awk inFile1.tsv inFile2.tsv inFile3.tsv > outFile123.tsv

BEGIN { FS = "\t" } #input File Style is tab-delimited
{ sub(/\r/, "") }   #remove all carriage return characters
!f[FILENAME]++ { ++indx }   #for all files inputted make a single index called indx
!a[$1]++ { keys[i++] = $1 } #the new indx comprises only unique strings in column 1
{ b[$1, indx] = $2 FS $3 }  #the 2nd and 3rd column are tab delimited and each pair that corresponds to a string saved in keys gets stored after the 1st column string in matrix b
END {
    for (i = 0; i in keys; ++i) {   #????
        key = keys[i]   #????
        printf "%s", keys   #prints out all strings in the index column 1 stored as keys
        for (j = 1; j <= indx; ++j) {   #????
            v = b[key, j]   #????
            printf "\t%s", length(v) ? v : "Mtee" FS "Mtee" #print out strings as tab delimited and replace any lengths of 1 char to two Mtee separated by a tab
        }
        print ""    #????
    }
}

推荐答案

这是另一个awk:

#!/usr/bin/awk -f
# Set field separator to tab (\t)
BEGIN { FS = "\t" }
# Remove carriage return characters if file is in DOS format (CRLF)
{ sub(/\r/, "") }
# Increment indx by 1 (starts at 0) everytime a new file is processed
!f[FILENAME]++ { ++indx }
# Add a key ($1) to keys array every time it is first encountered
!a[$1]++ { keys[i++] = $1 }
# Store the 2nd and 3rd field to b matrix
{ b[$1, indx] = $2 FS $3 }
# This block runs after all files are processed
END {
    # Traverse the keys in order
    for (i = 0; i in keys; ++i) {
        key = keys[i]
        # Print key
        printf "%s", key
        # Print columns from every file in order
        for (j = 1; j <= indx; ++j) {
            v = b[key, j]
            printf "\t%s", length(v) ? v : "Mtee" FS "Mtee"
        }
        # End the line with a newline
        print ""
    }
}

用法:

awk -f script.awk file1 file2 file3

输出:

Names   Title1  Title2  Title3  Title4  Title5  Title6
AAAA    1111    123456  Mtee    Mtee    3333    987654
BBBBB   1111    123456  2222    789456  Mtee    Mtee
CCC     1111    123456  Mtee    Mtee    3333    987654
DDDDD   Mtee    Mtee    2222    789456  Mtee    Mtee
EEEE    Mtee    Mtee    2222    789456  3333    987654

这篇关于根据第一列合并几个制表符分隔文件的某些列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆