通过一列合并两个文件 [英] merge two files by one column

查看:82
本文介绍了通过一列合并两个文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个文件

$ wc -l new_bacteria.txt 
28633861 new_bacteria.txt

$ wc -l allin1_trinity_bacteria_blastx.tsv 
4352 allin1_trinity_bacteria_blastx.tsv

$ head new_bacteria.txt
gi|406035365|ref|ZP_11042729.1| Acinetobacter parvus
gi|406035366|ref|ZP_11042730.1| Acinetobacter parvus
gi|406035367|ref|ZP_11042731.1| Acinetobacter parvus
gi|406035368|ref|ZP_11042732.1| Acinetobacter parvus
gi|406035369|ref|ZP_11042733.1| Acinetobacter parvus
gi|406035370|ref|ZP_11042734.1| Acinetobacter parvus
gi|406035371|ref|ZP_11042735.1| Acinetobacter parvus
gi|406035372|ref|ZP_11042736.1| Acinetobacter parvus
gi|406035373|ref|ZP_11042737.1| Acinetobacter parvus
gi|406035374|ref|ZP_11042738.1| Acinetobacter parvus

$ head allin1_trinity_bacteria_blastx.tsv

c91_g1_i1   gi|46447089|ref|YP_008454.1|    39.60   101 59  1   306 4   1676    1774    6e-11   68.2
c146_g1_i1  gi|357399595|ref|YP_004911520.1|    39.53   86  47  2   246 4   49  134 5e-06   52.0
c202_g1_i1  gi|508605652|ref|YP_006991274.2|    62.16   37  14  0   154 44  49  85  3e-06   45.4
c202_g1_i1  gi|508605652|ref|YP_006991274.2|    63.16   19  7   0   201 145 33  51  3e-06   27.7
c202_g1_i1  gi|508605652|ref|YP_006991274.2|    76.92   13  3   0   242 204 20  32  3e-06   21.6
c224_g1_i1  gi|395217261|ref|ZP_10401556.1| 72.62   84  23  0   260 9   274 357 6e-38     144
c230_g1_i1  gi|261381445|ref|ZP_05986018.1| 57.50   40  17  0   248 129 57  96  2e-09   45.8
c230_g1_i1  gi|261381445|ref|ZP_05986018.1| 50.00   42  19  1   120 1   101 142 2e-09   41.2
c294_g1_i1  gi|298242911|ref|ZP_06966718.1| 37.33   75  46  1   14  238 814 887 3e-07   56.2
c304_g1_i1  gi|296393792|ref|YP_003658676.1|    42.86   56  32  0   56  223 17  72  6e-06   51.2

我想通过allin1_trinity_bacteria_blastx.tsv的第二列合并这两个文件.我希望输出的文件与该tsv文件的行数相同,因为另一个文件确实很大.

I want to merge this two files by the second column of allin1_trinity_bacteria_blastx.tsv. And I wish to output a file have same number of lines of the this tsv file since the other file is really big.

这在R中是一件容易的事,但是由于这里我的注释文件(new_bacteria.txt)确实很大.我正在考虑使用unix合并.但是如何使输出仅包含tsv文件中所需的列,而不包含new_bacteria.txt文件中的所有留置权?

This is a easy job in R but since here my annotation file (new_bacteria.txt) is really big. I am thinking about using unix merge. But how can I make the output only contains the columns I want in the tsv file, but not all the liens in the new_bacteria.txt file?

谢谢!

推荐答案

我正在考虑使用unix合并.但是我怎样才能输出 仅包含tsv文件中我想要的列,但不包含所有 留在new_bacteria.txt文件中?

I am thinking about using unix merge. But how can I make the output only contains the columns I want in the tsv file, but not all the liens in the new_bacteria.txt file?

确实有一个名为merge的程序,但是尽管名称与R的merge()函数相匹配,但您不需要它的目的(将单独的更改组合到原始文件中);您宁可使用 join .请注意,文件必须在连接字段上排序.示例脚本在加入之前对这两个文件进行排序.如果new_bacteria.txt已经排序,则可以使用它代替sorted.txt;并且如果要在allin1_trinity_bacteria_blastx.tsv上运行多个联接,可能只对它排序一次并重用sorted.tsv可能是值得的.

There is indeed a program named merge, but despite the name match with the merge() function of R, its purpose (combining separate changes to an original file) is not what you need; you could rather use join. Note that the files must be sorted on the join fields. The example script sorts both files prior to joining; if new_bacteria.txt is already sorted, you can use it instead of sorted.txt; and if you want to run multiple joins on allin1_trinity_bacteria_blastx.tsv, it may be worth it to sort it only once and reuse the sorted.tsv.

sort -k2b allin1_trinity_bacteria_blastx.tsv >sorted.tsv
sort                        new_bacteria.txt >sorted.txt
join -1 2 sorted.tsv sorted.txt

这篇关于通过一列合并两个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆