使用 bash 脚本从可变数量的文件中连接列的最简单方法是什么? [英] What is the simplest method to join columns from variable number of files using bash script?

查看:17
本文介绍了使用 bash 脚本从可变数量的文件中连接列的最简单方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在一个目录中有输入文件.所有输入文件都具有相同的格式,我想将这些输入文件中的某些列合并为一个输出文件.

I have input files in one directory. All the input files have the same format and I'd like to join certain columns from these input files into one output file.

例如:

在文件 1 中

Adam    0.5 a1
Bills   0.7 b1
Carol   0.8 c1
Dean    0.4 d1

在文件 2 中

Adam    0.4 a2
Carol   0.8 c2
Evan    0.9 e2

在文件 3 中

Bills   0.6 b3
Carol   0.7 c3
Evan    0.1 e3

我想通过使用第一列作为键来连接所有输入文件中的第三列.所以输出可能看起来像

I'd like to join the third column from all input files by using the first column as a key. So the output may look like

Adam    a1  a2  NA
Bills   b1  NA  b3
Carol   c1  c2  c3
Dean    d1  NA  NA
Evan    NA  e2  e3

因为输入文件的数量不同,输出的列数也不同.输入文件的数量至少为 200 个,最大为 10,000 个.

Because the number of input files are varied, the number of columns in output are also varied. The number of input files are at least 200 and can be maximum at 10,000.

我找不到一种简单的方法来使用for"、awk"、join"、cut"来解决这个问题.是的,我可以编写一个 Python 或 Perl 脚本来解决这个问题,但我想知道这是否可以单独使用 bash 脚本来完成?

I couldn't find a simple way to use 'for', 'awk', 'join', 'cut' to solve this problem. And yes, I can write a Python or Perl script to solve this problem but I wonder if this can be done using bash script alone?

ps.在问这个问题之前,我试图寻找解决方案,但找不到.如果已经有人问过这种问题,请指点我的答案.

ps. I tried to search for a solution before asking this question but couldn't find it. If this kind of question is already asked, please point me to the answer.

推荐答案

您可以通过组合两个 join 来实现.

You can do this by combining two joins.

$ join -o '0,1.3,2.3' -a1 -a2 -e 'NA' file1 file2
Adam a1 a2
Bills b1 NA
Carol c1 c2
Dean d1 NA
Evan NA e2

首先将前两个文件连接在一起,使用 -a1 -a2 确保仅出现在一个文件中的行仍会打印.-o '0,1.3,2.3' 控制输出哪些字段,-e 'NA'NA 替换缺失的字段.

First join the first two files together, using -a1 -a2 to make sure lines that are only present in one file are still printed. -o '0,1.3,2.3' controls which fields are output and -e 'NA' replaces missing fields with NA.

$ join -o '0,1.3,2.3' -a1 -a2 -e 'NA' file1 file2 | join -o '0,1.2,1.3,2.3' -a1 -a2 -e 'NA' - file3
Adam a1 a2 NA
Bills b1 NA b3
Carol c1 c2 c3
Dean d1 NA NA
Evan NA e2 e3

然后将该 join 管道连接到另一个连接第三个文件的文件.这里的技巧是传入 - 作为第一个文件名,它告诉 join 使用 stdin 作为第一个文件.

Then pipe that join to another one which joins the third file. The trick here is passing in - as the first file name, which tells join to use stdin as the first file.

对于任意数量的文件,这里有一个递归应用这个想法的脚本.

For an arbitrary number of files, here's a script which applies this idea recursively.

#!/bin/bash

join_all() {
    local file=$1
    shift

    awk '{print $1, $3}' "$file" | {
        if (($# > 0)); then
            join2 - <(join_all "$@") $(($# + 1))
        else
            cat
        fi
    }
}

join2() {
    local file1=$1
    local file2=$2
    local count=$3

    local fields=$(eval echo 2.{2..$count})
    join -a1 -a2 -e 'NA' -o "0 1.2 $fields" "$file1" "$file2"
}

join_all "$@"

示例用法:

$ ./joinall file1
Adam a1
Bills b1
Carol c1
Dean d1

$ ./joinall file1 file2
Adam a1 a2
Bills b1 NA
Carol c1 c2
Dean d1 NA
Evan NA e2

$ ./joinall file1 file2 file3
Adam a1 a2 NA
Bills b1 NA b3
Carol c1 c2 c3
Dean d1 NA NA
Evan NA e2 e3

这篇关于使用 bash 脚本从可变数量的文件中连接列的最简单方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆