庆典:三个文件的清洁外连接,preserving文件会员 [英] bash: clean outer join of three files, preserving file-membership

查看:85
本文介绍了庆典:三个文件的清洁外连接,preserving文件会员的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑第一行中的以下三个文件与标头:

文件1:

  ID名称IN1
1乔恩·1
2告1

文件2:

  ID名称平方英寸
2告1
3鲍勃·1

file3的:

  ID名称IN3
2告1
3亚当1

我要合并这些文件得到下面的输出,merged_files:

  ID名称IN1 IN2 IN3
乔恩·1 1 0 0
2起诉1 1 1
3鲍勃0 1 0
3亚当0 0 1

这要求有一些特殊的功能,我没有找到一个方便的方式来实现的grep / SED / AWK /加盟等编辑:你可以假设,为了简单起见,这三个文件都已经被排序。

解决方案

这是非常类似于<一个问题解决了href=\"http://stackoverflow.com/questions/17459789/bash-script-to-find-matching-rows-from-multiple-csv-files-and-create-report-out/17460732#17460732\">Bash脚本找到多个CSV文件匹配的行。它不相同,但它是非常相似的。 (如此的相似,我只需要移去三个排序命令,改变了三种 SED 命令咯,更改文件名,改变从的'失踪'值没有 0 ,并改变在最后的 sed的从逗号空间。)

加入命令 SED (通常为排序太,但数据已经足够排序)是所需要的主要工具。假设没有出现在原始数据。要录制行的presence在一个文件中,我们希望有一个 1 字段在文件中(这是几乎没有);我们将有加入提供 0时,有不匹配。在 1 在每个非标题行的结尾需要成为 1 ,并在标题中的最后一个字段也需要由 pceded $ p $:。然后,使用庆典的的进程替换,我们可以这样写:

  $ SED的/ [] \\([^] * \\)$ /:\\ 1 /'文件1 |
&GT;加入-t:-A 1 -A 2 -e 0 -o 0,1.2,2.2 - ≤(sed的S / [] \\([^] * \\)$ /:\\ 1 /'文件2)|
&GT;加入-t:-A 1 -A 2 -e 0 -o 0,1.2,1.3,2.2 - ≤(sed的S / [] \\([^] * \\)$ /:\\ 1 /'文件3)|
&GT; SED的/:/ / G'
ID名称IN1 IN2 IN3
乔恩·1 1 0 0
2起诉1 1 1
3亚当0 0 1
3鲍勃0 1 0
$

SED 命令(三次)增加了中的文件的每一行的最后一个字段之前。该连接是非常接近对称的。在 -t:指定分隔符是冒号;在 -a 1 -a 2 的意思是,当有不匹配在一个文件中,行会仍然被包括在输出;在 0 -e 意味着,如果没有在文件中的匹配, 0 在输出端产生;和 -o 选项指定输出列。对于第一个加入, -o 0,1.2,2.2 输出连接列(0),则第二列( 1 )。第二个连接在输入3列,所以它指定 -o 0,1.2,1.3,2.2 。参数 - 它自己的方式读标准输入。在≤(...)符号是'替代的过程,其中一个文件名(通常为的/ dev / FD / NN )被提供到连接命令,它包含括号内的命令的输出。输出然后通过 SED 过滤一次空格替换冒号,产生所需的输出。

这是所需的输出唯一的区别是 3先令 3亚当的排序;它不是在你所需的输出命令凭什么他们在反向特别清楚。如果它是至关重要的,手段可以设计为(如排序-k1,1 -k3,5 解决不同的顺序,除了数据后排序标签行;也有变通办法,如果有必要)

Consider the following three files with headers in the first row:

file1:

id name in1
1 jon 1
2 sue 1

file2:

id name in2
2 sue 1
3 bob 1

file3:

id name in3
2 sue 1
3 adam 1

I want to merge these files to get the following output, merged_files:

id name in1 in2 in3
1 jon 1 0 0
2 sue 1 1 1
3 bob 0 1 0
3 adam 0 0 1

This request has several special features that I have not found implemented in a handy way in grep/sed/awk/join etc. Edit: You may assume, for simplicity, that the three files have already been sorted.

解决方案

This is very similar to the problem solved in Bash script to find matching rows from multiple CSV files. It's not identical, but it is very similar. (So similar that I only had to remove three sort commands, change the three sed commands slightly, change the file names, change the 'missing' value from no to 0, and change the replacement in the final sed from comma to space.)

The join command with sed (usually sort too, but the data is already sufficiently sorted) are the primary tools needed. Assume that : does not appear in the original data. To record the presence of a row in a file, we want a 1 field in the file (it's almost there); we'll have join supply the 0 when there isn't a match. The 1 at the end of each non-heading line needs to become :1, and the last field in the heading also needs to be preceded by the :. Then, using bash's process substitution, we can write:

$ sed 's/[ ]\([^ ]*\)$/:\1/' file1 |
> join -t: -a 1 -a 2 -e 0 -o 0,1.2,2.2     - <(sed 's/[ ]\([^ ]*\)$/:\1/' file2) |
> join -t: -a 1 -a 2 -e 0 -o 0,1.2,1.3,2.2 - <(sed 's/[ ]\([^ ]*\)$/:\1/' file3) |
> sed 's/:/ /g'
id name in1 in2 in3
1 jon 1 0 0
2 sue 1 1 1
3 adam 0 0 1
3 bob 0 1 0
$

The sed command (three times) adds the : before the last field in each line of the files. The joins are very nearly symmetric. The -t: specifies that the field separator is the colon; the -a 1 and -a 2 mean that when there isn't a match in a file, the line will still be included in the output; the -e 0 means that if there isn't a match in a file, a 0 is generated in the output; and the -o option specifies the output columns. For the first join, -o 0,1.2,2.2 the output is the join column (0), then the second column (the 1) from the two files. The second join has 3 columns in the input, so it specifies -o 0,1.2,1.3,2.2. The argument - on its own means 'read standard input'. The <(...) notation is 'process substitution', where a file name (usually /dev/fd/NN) is provided to the join command, and it contains the output of the command inside the parentheses. The output is then filtered through sed once more to replace the colons with spaces, yielding the desired output.

The only difference from the desired output is the sequencing of 3 bob after 3 adam; it is not particularly clear on what basis you ordered them in reverse in your desired output. If it is crucial, a means can be devised for resolving the order differently (such as sort -k1,1 -k3,5, except that sorts the label line after the data; there are workarounds for that if necessary).

这篇关于庆典:三个文件的清洁外连接,preserving文件会员的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆