庆典:三个文件的清洁外连接,preserving文件会员 [英] bash: clean outer join of three files, preserving file-membership
问题描述
考虑第一行中的以下三个文件与标头:
文件1:
ID名称IN1
1乔恩·1
2告1
文件2:
ID名称平方英寸
2告1
3鲍勃·1
file3的:
ID名称IN3
2告1
3亚当1
我要合并这些文件得到下面的输出,merged_files:
ID名称IN1 IN2 IN3
乔恩·1 1 0 0
2起诉1 1 1
3鲍勃0 1 0
3亚当0 0 1
这要求有一些特殊的功能,我没有找到一个方便的方式来实现的grep / SED / AWK /加盟等编辑:你可以假设,为了简单起见,这三个文件都已经被排序。
这是非常类似于<一个问题解决了href=\"http://stackoverflow.com/questions/17459789/bash-script-to-find-matching-rows-from-multiple-csv-files-and-create-report-out/17460732#17460732\">Bash脚本找到多个CSV文件匹配的行。它不相同,但它是非常相似的。 (如此的相似,我只需要移去三个排序
命令,改变了三种 SED
命令咯,更改文件名,改变从的'失踪'值没有
到 0
,并改变在最后的 sed的
从逗号空间。)
在加入
命令 SED
(通常为排序
太,但数据已经足够排序)是所需要的主要工具。假设:
没有出现在原始数据。要录制行的presence在一个文件中,我们希望有一个 1
字段在文件中(这是几乎没有);我们将有加入
提供 0时,有不匹配
。在 1
在每个非标题行的结尾需要成为 1
,并在标题中的最后一个字段也需要由 pceded $ p $:
。然后,使用庆典
的的进程替换,我们可以这样写:
$ SED的/ [] \\([^] * \\)$ /:\\ 1 /'文件1 |
&GT;加入-t:-A 1 -A 2 -e 0 -o 0,1.2,2.2 - ≤(sed的S / [] \\([^] * \\)$ /:\\ 1 /'文件2)|
&GT;加入-t:-A 1 -A 2 -e 0 -o 0,1.2,1.3,2.2 - ≤(sed的S / [] \\([^] * \\)$ /:\\ 1 /'文件3)|
&GT; SED的/:/ / G'
ID名称IN1 IN2 IN3
乔恩·1 1 0 0
2起诉1 1 1
3亚当0 0 1
3鲍勃0 1 0
$
的 SED
命令(三次)增加了:
中的文件的每一行的最后一个字段之前。该连接是非常接近对称的。在 -t:
指定分隔符是冒号;在 -a 1
和 -a 2
的意思是,当有不匹配在一个文件中,行会仍然被包括在输出;在 0 -e
意味着,如果没有在文件中的匹配, 0
在输出端产生;和 -o
选项指定输出列。对于第一个加入, -o 0,1.2,2.2
输出连接列(0),则第二列( 1 从两个文件code>)。第二个连接在输入3列,所以它指定
-o 0,1.2,1.3,2.2
。参数 -
它自己的方式读标准输入。在≤(...)
符号是'替代的过程,其中一个文件名(通常为的/ dev / FD / NN
)被提供到连接命令,它包含括号内的命令的输出。输出然后通过 SED
过滤一次空格替换冒号,产生所需的输出。
这是所需的输出唯一的区别是 3先令
在 3亚当
的排序;它不是在你所需的输出命令凭什么他们在反向特别清楚。如果它是至关重要的,手段可以设计为(如排序-k1,1 -k3,5
解决不同的顺序,除了数据后排序标签行;也有变通办法,如果有必要)
Consider the following three files with headers in the first row:
file1:
id name in1
1 jon 1
2 sue 1
file2:
id name in2
2 sue 1
3 bob 1
file3:
id name in3
2 sue 1
3 adam 1
I want to merge these files to get the following output, merged_files:
id name in1 in2 in3
1 jon 1 0 0
2 sue 1 1 1
3 bob 0 1 0
3 adam 0 0 1
This request has several special features that I have not found implemented in a handy way in grep/sed/awk/join etc. Edit: You may assume, for simplicity, that the three files have already been sorted.
This is very similar to the problem solved in Bash script to find matching rows from multiple CSV files. It's not identical, but it is very similar. (So similar that I only had to remove three sort
commands, change the three sed
commands slightly, change the file names, change the 'missing' value from no
to 0
, and change the replacement in the final sed
from comma to space.)
The join
command with sed
(usually sort
too, but the data is already sufficiently sorted) are the primary tools needed. Assume that :
does not appear in the original data. To record the presence of a row in a file, we want a 1
field in the file (it's almost there); we'll have join
supply the 0
when there isn't a match. The 1
at the end of each non-heading line needs to become :1
, and the last field in the heading also needs to be preceded by the :
. Then, using bash
's process substitution, we can write:
$ sed 's/[ ]\([^ ]*\)$/:\1/' file1 |
> join -t: -a 1 -a 2 -e 0 -o 0,1.2,2.2 - <(sed 's/[ ]\([^ ]*\)$/:\1/' file2) |
> join -t: -a 1 -a 2 -e 0 -o 0,1.2,1.3,2.2 - <(sed 's/[ ]\([^ ]*\)$/:\1/' file3) |
> sed 's/:/ /g'
id name in1 in2 in3
1 jon 1 0 0
2 sue 1 1 1
3 adam 0 0 1
3 bob 0 1 0
$
The sed
command (three times) adds the :
before the last field in each line of the files. The joins are very nearly symmetric. The -t:
specifies that the field separator is the colon; the -a 1
and -a 2
mean that when there isn't a match in a file, the line will still be included in the output; the -e 0
means that if there isn't a match in a file, a 0
is generated in the output; and the -o
option specifies the output columns. For the first join, -o 0,1.2,2.2
the output is the join column (0), then the second column (the 1
) from the two files. The second join has 3 columns in the input, so it specifies -o 0,1.2,1.3,2.2
. The argument -
on its own means 'read standard input'. The <(...)
notation is 'process substitution', where a file name (usually /dev/fd/NN
) is provided to the join command, and it contains the output of the command inside the parentheses. The output is then filtered through sed
once more to replace the colons with spaces, yielding the desired output.
The only difference from the desired output is the sequencing of 3 bob
after 3 adam
; it is not particularly clear on what basis you ordered them in reverse in your desired output. If it is crucial, a means can be devised for resolving the order differently (such as sort -k1,1 -k3,5
, except that sorts the label line after the data; there are workarounds for that if necessary).
这篇关于庆典:三个文件的清洁外连接,preserving文件会员的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!