如何串联具有相同名称开头的文件? [英] How to concatenate files that have the same beginning of a name?

查看:57
本文介绍了如何串联具有相同名称开头的文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含数百个* .fasta文件的目录,例如:

I have a directory with a few hundred *.fasta files, such as:

Bonobo_sp._str01_ABC784267_CDE789456.fasta
Homo_sapiens_cc21_ABC897867_CDE456789.fasta
Homo_sapiens_cc21_ABC893673_CDE753672.fasta 
Gorilla_gorilla_ghjk6789_ABC736522_CDE789456.fasta
Gorilla_gorilla_ghjk6789_ABC627190_CDE891345.fasta
Gorilla_gorilla_ghjk6789_ABC117190_CDE661345.fasta

我要串联属于同一物种的文件,因此在这种情况下为Homo_sapiens_cc21和Gorilla_gorilla_ghjk6789.

I want to concatenate files that belong to the same species, so in this case Homo_sapiens_cc21 and Gorilla_gorilla_ghjk6789.

几乎每个物种都有需要连接的不同数量的文件.

Almost every species has different number of files that I need to concatenate.

我知道我可以在unix/linux中使用一个简单的循环,例如:

I know that I could use a simple loop in unix/linux like:

    for f in thesamename.fasta; do
        cat $f >> output.fasta
    done

但是我不知道如何在循环中指定如何仅识别开头相同的文件.手动处理数百个文件根本没有任何意义.

But I don't know how to specify in a loop how should it recognize only files with the same beginning. Making that manually does not make sense at all with hundreds of files.

有人知道我该怎么做吗?

Does anybody have any idea how could I do that?

推荐答案

我将假定命名的逻辑是,物种是用下划线分隔的前三个单词.我还要假设文件名中没有空格.

I will assume that the logic behind the naming is that the species are the first three words separated by underscores. I will also assume that there are no blank spaces in the filenames.

一种可行的策略可能是获取所有物种的列表,然后将所有带有该物种/前缀的文件合并为一个文件:

A possible strategy could be to get a list of all the species, and then concatenate all the files with that specie/prefix into a single one:

for specie in $(ls *.fasta | cut -f1-3 -d_ | sort -u)
do
    cat "$specie"*.fasta > "$specie.fasta"
done

在此代码中,您列出了所有的fasta文件,削减了物种ID,并生成了唯一的物种列表.然后,您遍历此列表,并针对每个物种,将以该物种ID开头的所有文件连接到一个具有物种名称的文件中.

In this code, you list all the fasta files, cut the specie ID and generate an unique list of species. Then you traverse this list and, for every specie, concatenate all the files that start with that specie ID into a single file with the specie name.

可以使用 find 并避免使用 ls 来编写更强大的解决方案,但是它们更加冗长且可能不太清晰:

More robust solutions can be written using find and avoiding ls, but they are more verbose and potentialy less clear:

while IFS= read -r -d '' specie
do
    cat "$specie"*.fasta > "$specie.fasta"
done < <(find -maxdepth 1 -name "*.fasta" -print0 | cut -z -f2 -d/ | cut -z -f1-3 -d_ | sort -zu)

这篇关于如何串联具有相同名称开头的文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆