如何将 comm 命令的输出放入 3 个单独的文件中? [英] How to get the output from the comm command into 3 separate files?

查看:56
本文介绍了如何将 comm 命令的输出放入 3 个单独的文件中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题 Unix 命令查找两个文件中的公共行 有一个 answer 建议使用 comm 命令来完成任务:>

The question Unix command to find lines common in two files has an answer suggesting the use of the comm command to do the task:

comm -12 1.sorted.txt 2.sorted.txt

这显示了两个文件共有的行(-1 禁止仅在第一个文件中的行,而 -2 仅禁止在第一个文件中的行第二个文件,只留下两个文件共有的行作为输出).正如文件名所暗示的那样,输入文件必须按顺序排列.

This shows the lines common to the two files (the -1 suppresses the lines that are only in the first file, and the -2 suppresses the lines only in the second file, leaving just the lines common to both files as output). As the file names suggest, the input files must be in sorted order.

评论中 对于这个问题,bapors 问:

In a comment to that question, bapors asks:

如何将输出保存在不同的文件中?

How would one have the outputs in different files?

为了澄清,我问:

如果您只希望一个文件中的 File1 中的行,另一个文件中 File2 中的行,以及第三个中的两个中的行,那么(假设文件中的所有行都没有以制表符开头),您可以使用 sed 将输出拆分为三个文件.

If you want the lines only in File1 in one file, those only in File2 in another, and those in both in a third, then (provided that none of the lines in the files starts with a tab) you could use sed to split the output to three files.

用户 bapor 确认:

User bapors confirmed:

这正是我想问的.你能举个例子吗?

It is exactly what I was asking. Would you show an example?

答案相对冗长,会破坏另一个问题答案的简单性(用大量信息淹没它),所以我在这里单独提出了这个问题 - 并且也提供了答案.

The answer is relatively long-winded and would spoil the simplicity of the answer to the other question (drowning it out with lots of information), so I've asked the question separately here — and provided an answer too.

推荐答案

使用 sed 的基本解决方案依赖于 comm 输出仅在第一个文件中找到的行这一事实没有前缀;它用一个选项卡输出仅在第二个文件中找到的行;并用两个选项卡输出在两个文件中找到的行.

The basic solution using sed relies on the fact that comm outputs lines found only in the first file with no prefix; it outputs the lines found only in the second file with a single tab; and it outputs the lines found in both files with two tabs.

它还依赖于 sedw 命令写入文件.

It also relies on sed's w command to write to files.

给定文件 1.sorted.txt 包含:

1.line-1
1.line-2
1.line-4
1.line-6
2.line-2
3.line-5

和文件 2.sorted.txt 包含:

1.line-3
2.line-1
2.line-2
2.line-4
2.line-6
3.line-5

comm 1.sorted.txt 2.sorted.txt 的基本输出是:

1.line-1
1.line-2
        1.line-3
1.line-4
1.line-6
        2.line-1
                2.line-2
        2.line-4
        2.line-6
                3.line-5

给定一个文件 script.sed 包含:

Given a file script.sed containing:

/^\t\t/ {
    s///
    w file.3
    d
}
/^\t/ {
    s///
    w file.2
    d
}
/^[^\t]/ {
    w file.1
    d
}

您可以运行如下所示的命令并获得所需的输出,如下所示:

you can run the command shown below and get the desired output like this:

$ comm 1.sorted.txt 2.sorted.txt | sed -f script.sed
$ cat file.1
1.line-1
1.line-2
1.line-4
1.line-6
$ cat file.2
1.line-3
2.line-1
2.line-4
2.line-6
$ cat file.3
2.line-2
3.line-5
$

脚本的工作原理:

  1. 匹配以 2 个制表符开头的行,删除制表符,将该行写入 file.3,然后删除该行(因此脚本的其余部分被忽略),
  2. 匹配以 1 个制表符开头的行,删除制表符,将该行写入 file.2,然后删除该行(因此脚本的其余部分将被忽略),
  3. 匹配不以制表符开头的行,将该行写入 file.1,然后删除该行.
  1. matching lines that start with 2 tabs, deleting the tabs, writing the line to file.3, and deleting the line (so the rest of the script is ignored),
  2. matching lines that start with 1 tab, deleting the tab, writing the line to file.2, and deleting the line (so the rest of the script is ignored),
  3. matching lines that do not start with a tab, writing the line to file.1, and deleting the line.

第3步的匹配和删除操作更多的是为了对称性;它们可以被省略(只留下 w file.1)并且这个脚本的工作方式相同.但是,请参阅下面的 script3.sed 以获取保持对称性的进一步理由.

The match and delete operations in step 3 are more for symmetry than anything else; they could be omitted (leaving just w file.1) and this script would work the same. However, see script3.sed below for further justification for keeping the symmetry.

正如所写,这需要 GNU sed;BSD sed 无法识别 \t 转义符.显然,该文件可以用实际的制表符代替 \t 符号来编写,然后 BSD sed 脚本就可以了.

As written, that requires GNU sed; BSD sed doesn't recognize the \t escapes. Obviously, the file could be written with actual tabs in place of the \t notation, and then BSD sed is OK with the script.

可以让它在命令行上运行,但它很繁琐(这是礼貌).使用 Bash 的 ANSI C Quoting,你可以这样写:

It is possible to make it work all on the command line, but it is fiddly (and that's being polite about it). Using Bash's ANSI C Quoting, you can write:

$ comm 1.sorted.txt 2.sorted.txt |
> sed -e $'/^\t\t/  { s///\n w file.3\n d\n }' \
>     -e $'/^\t/    { s///\n w file.2\n d\n }' \
>     -e $'/^[^\t]/ {        w file.1\n d\n }'
$

在单独的 -e 选项中写入 script.sed 的三个段落"中的每一个.w 命令很繁琐;它需要文件名,并且只有文件名,在脚本的同一行之后,因此在脚本中的文件名之后使用 \n .有很多空间可以消除,但所示布局的对称性更加清晰.并且使用 -f script.sed 文件可能更简单 — 这当然是一项值得了解的技术,因为当 sed 脚本必须对单、双和反引号,这使得在 Bash 命令行上编写脚本变得困难.

which writes each of the three 'paragraphs' of script.sed in a separate -e option. The w command is fussy; it expects the file name, and only the file name, after it on the same line of the script, hence the use of \n after the file names in the script. There are spaces aplenty that could be eliminated, but the symmetry is clearer with the layout shown. And using the -f script.sed file is probably simpler — it is certainly a technique worth knowing as it can avoid problems when the sed script must operate on single, double and back-quotes, which makes it difficult to write the script on the Bash command line.

最后,如果这两个文件可以包含以制表符开头的行,则此技术需要更多的蛮力才能使其工作.一种变体解决方案利用 Bash 的 进程替换 添加一个在文件中的行之前添加前缀,然后后处理 sed 脚本在写入输出文件之前删除前缀.

Finally, if the two files can contain lines starting with tabs, this technique requires some more brute force to make it work. One variant solution exploits Bash's process substitution to add a prefix before the lines in the files, and then the post-processing sed script removes the prefixes before writing to the output files.

script3.sed(用最多 8 个空格替换制表符)——注意这次在第三段需要一个替代的 s///(d 仍然是可选的,但也可以包含在内):

script3.sed (with tabs replaced by up to 8 spaces) — note that this time there is a substitute s/// needed in the third paragraph (the d is still optional, but may as well be included):

/^              X/ {
    s///
    w file.3
    d
}
/^      X/ {
    s///
    w file.2
    d
}
/^X/ {
    s///
    w file.1
    d
}

还有命令行:

$ comm <(sed 's/^/X/' 1.sorted.txt) <(sed 's/^/X/' 2.sorted.txt) |
> sed -f script3.sed
$

对于相同的输入文件,这会产生相同的输出,但是通过在每行的开头添加然后删除 X,代码不会改变数据的排序顺序,并且如果它们存在,将处理前导标签.

For the same input files, this produces the same output, but by adding and then removing the X at the start of each line, the code doesn't change the sort order of the data and would handle leading tabs if they were present.

您还可以轻松编写使用 Perl 或 Awk 的解决方案,这些解决方案甚至不必使用 comm(并且可以使用未排序的文件,前提是这些文件适合内存).

You can also easily write solutions that use Perl or Awk, and those do not even have to use comm (and can be made to work with unsorted files, provided the files fit into memory).

这篇关于如何将 comm 命令的输出放入 3 个单独的文件中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆