使用bash脚本从可变数量的文件连接列的最简单方法是什么? [英] What is the simplest method to join columns from variable number of files using bash script?

查看:91
本文介绍了使用bash脚本从可变数量的文件连接列的最简单方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在一个目录中有输入文件.所有输入文件都具有相同的格式,我想将这些输入文件中的某些列连接到一个输出文件中.

I have input files in one directory. All the input files have the same format and I'd like to join certain columns from these input files into one output file.

例如:

在File1中

Adam    0.5 a1
Bills   0.7 b1
Carol   0.8 c1
Dean    0.4 d1

在File2中

Adam    0.4 a2
Carol   0.8 c2
Evan    0.9 e2

在File3中

Bills   0.6 b3
Carol   0.7 c3
Evan    0.1 e3

我想通过使用第一列作为键从所有输入文件中加入第三列.所以输出看起来像

I'd like to join the third column from all input files by using the first column as a key. So the output may look like

Adam    a1  a2  NA
Bills   b1  NA  b3
Carol   c1  c2  c3
Dean    d1  NA  NA
Evan    NA  e2  e3

由于输入文件的数目不同,所以输出中的列数也有所不同.输入文件的数量至少为200,最大为10,000.

Because the number of input files are varied, the number of columns in output are also varied. The number of input files are at least 200 and can be maximum at 10,000.

我找不到使用'for','awk','join','cut'来解决此问题的简单方法.是的,我可以编写Python或Perl脚本来解决此问题,但我想知道是否可以单独使用bash脚本来完成?

I couldn't find a simple way to use 'for', 'awk', 'join', 'cut' to solve this problem. And yes, I can write a Python or Perl script to solve this problem but I wonder if this can be done using bash script alone?

ps.在问这个问题之前,我试图寻找一种解决方案,但是找不到.如果已经问过这种问题,请指出答案.

ps. I tried to search for a solution before asking this question but couldn't find it. If this kind of question is already asked, please point me to the answer.

推荐答案

您可以通过组合两个join来做到这一点.

You can do this by combining two joins.

$ join -o '0,1.3,2.3' -a1 -a2 -e 'NA' file1 file2
Adam a1 a2
Bills b1 NA
Carol c1 c2
Dean d1 NA
Evan NA e2

首先使用-a1 -a2将前两个文件连接在一起,以确保仅打印一个文件中存在的行. -o '0,1.3,2.3'控制要输出的字段,而-e 'NA'NA替换缺少的字段.

First join the first two files together, using -a1 -a2 to make sure lines that are only present in one file are still printed. -o '0,1.3,2.3' controls which fields are output and -e 'NA' replaces missing fields with NA.

$ join -o '0,1.3,2.3' -a1 -a2 -e 'NA' file1 file2 | join -o '0,1.2,1.3,2.3' -a1 -a2 -e 'NA' - file3
Adam a1 a2 NA
Bills b1 NA b3
Carol c1 c2 c3
Dean d1 NA NA
Evan NA e2 e3

然后通过管道将join连接到另一个连接第三个文件的管道.这里的技巧是传入-作为第一个文件名,告诉join使用stdin作为第一个文件.

Then pipe that join to another one which joins the third file. The trick here is passing in - as the first file name, which tells join to use stdin as the first file.

对于任意数量的文件,下面是一个脚本,该脚本可递归地应用此想法.

For an arbitrary number of files, here's a script which applies this idea recursively.

#!/bin/bash

join_all() {
    local file=$1
    shift

    awk '{print $1, $3}' "$file" | {
        if (($# > 0)); then
            join2 - <(join_all "$@") $(($# + 1))
        else
            cat
        fi
    }
}

join2() {
    local file1=$1
    local file2=$2
    local count=$3

    local fields=$(eval echo 2.{2..$count})
    join -a1 -a2 -e 'NA' -o "0 1.2 $fields" "$file1" "$file2"
}

join_all "$@"

示例用法:

$ ./joinall file1
Adam a1
Bills b1
Carol c1
Dean d1

$ ./joinall file1 file2
Adam a1 a2
Bills b1 NA
Carol c1 c2
Dean d1 NA
Evan NA e2

$ ./joinall file1 file2 file3
Adam a1 a2 NA
Bills b1 NA b3
Carol c1 c2 c3
Dean d1 NA NA
Evan NA e2 e3

这篇关于使用bash脚本从可变数量的文件连接列的最简单方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆