Bash脚本将文本文件与文件名中的特定子字符串连接在一起 [英] Bash script to concatenate text files with specific substrings in filenames

查看:262
本文介绍了Bash脚本将文本文件与文件名中的特定子字符串连接在一起的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在某个目录中,我有许多包含一堆文本文件的目录.我正在尝试编写一个脚本,该脚本仅将每个目录中文件名中带有字符串"R1"的文件连接到该特定目录中的一个文件,并将另一个文件中具有"R2"的文件连接起来.这是我写的,但是没有用.

Within a certain directory I have many directories containing a bunch of text files. I’m trying to write a script that concatenates only those files in each directory that have the string ‘R1’ in their filename into one file within that specific directory, and those that have ‘R2’ in another . This is what I wrote but it’s not working.

#!/bin/bash

for f in */*.fastq; do

    if grep 'R1' $f ; then
        cat "$f" >> R1.fastq
    fi

    if grep 'R2' $f ; then
        cat "$f" >> R2.fastq
    fi

done

我没有收到任何错误,并且文件已按预期创建,但它们是空文件.谁能告诉我我在做什么错?

I get no errors and the files are created as intended but they are empty files. Can anyone tell me what I’m doing wrong?

感谢大家的快速详细的回复!我想我的问题不太清楚,但是我需要脚本仅将每个特定目录中的文件连接起来,以便每个目录都有一个新文件(R1和R2).我尝试过

Thank you all for the fast and detailed responses! I think I wasn't very clear in my question, but I need the script to only concatenate the files within each specific directory so that each directory has a new file ( R1 and R2). I tried doing

cat /*R1*.fastq >*/R1.fastq 

但是它给了我一个模棱两可的重定向错误.我还尝试了Charles Duffy的for循环,但是循环遍历目录并执行嵌套循环以使目录中的每个文件都可以运行,例如

but it gave me an ambiguous redirect error. I also tried Charles Duffy's for loop but looping through the directories and doing a nested loop to run though each file within a directory like so

for f in */; do
   for d in "$f"/*.fastq;do
     case "$d" in
       *R1*) cat "$d" >&3
       *R2*) cat "$d" >&4
     esac
   done 3>R1.fastq 4>R2.fastq
done

但是它给出了关于')'的意外令牌错误.

but it was giving an unexpected token error regarding ')'.

对不起,如果我缺少基本知识,我对bash还是很陌生.

Sorry in advance if I'm missing something elementary, I'm still very new to bash.

推荐答案

给读者的说明

在考虑此答案时,请查看该问题的编辑历史记录;问题编辑使某些部分的相关性降低.

A Note To The Reader

Please review edit history on the question in considering this answer; several parts have been made less relevant by question edits.

出于手头的目的,您可能只需要让外壳遍历即可完成所有工作(如果R1R2将在文件名中,而不是目录名):

For the purpose at hand, you can probably just let shell globbing do all the work (if R1 or R2 will be in the filenames, as opposed to the directory names):

set -x # log what's happening!
cat */*R1*.fastq >R1.fastq
cat */*R2*.fastq >R2.fastq


每个输出文件一个find

相反,如果文件太多,则可能需要find:


One find Per Output File

If it's a really large number of files, by contrast, you might need find:

find . -mindepth 2 -maxdepth 2 -type f -name '*R1*.fastq' -exec cat '{}' + >R1.fastq
find . -mindepth 2 -maxdepth 2 -type f -name '*R2*.fastq' -exec cat '{}' + >R2.fastq

...这是由于与操作系统有关的命令行长度限制;上面给出的find命令将在每个cat命令上尽可能多地添加参数,以提高效率,但仍将它们分成多个调用,否则将超出限制.

...this is because of the OS-dependent limit on command-line length; the find command given above will put as many arguments onto each cat command as possible for efficiency, but will still split them up into multiple invocations where otherwise the limit would be exceeded.

如果您确实确实想遍历所有内容,然后测试名称,请考虑使用case语句完成工作,这比使用grep仅检查一行更有效:

If you really do want to iterate over everything, and then test the names, consider a case statement for the job, which is much more efficient than using grep to check just one line:

for f in */*.fastq; do
  case $f in
    *R1*) cat "$f" >&3
    *R2*) cat "$f" >&4
  esac
done 3>R1.fastq 4>R2.fastq

请注意使用文件描述符3和4分别写入R1.fastqR2.fastq-这样,我们只打开输出文件一次(因此将它们截断一次 一次) ),当for循环开始时,重新使用这些文件描述符,而不是在每个cat的开头重新打开输出文件. (也就是说,每个文件只运行一次cat,而find -exec {} +避免了这种情况-可能会增加总的开销).

Note the use of file descriptors 3 and 4 to write to R1.fastq and R2.fastq respectively -- that way we're only opening the output files once (and thus truncating them exactly once) when the for loop starts, and reusing those file descriptors rather than re-opening the output files at the beginning of each cat. (That said, running cat once per file -- which find -exec {} + avoids -- is probably more overhead on balance).

以上所有内容都可以更新为在每个目录下都可以正常工作.例如:

All of the above can be updated to work on a per-directory basis quite trivially. For example:

for d in */; do
  find "$d" -name R1.fastq -prune -o -name '*R1*.fastq' -exec cat '{}' + >"$d/R1.fastq"
  find "$d" -name R2.fastq -prune -o -name '*R2*.fastq' -exec cat '{}' + >"$d/R2.fastq"
done

只有两个重大更改:

  • 我们不再指定-mindepth,以确保我们的输入文件仅来自子目录.
  • 我们从输入文件中排除了R1.fastqR2.fastq,因此我们永远不会尝试将相同的文件用作输入和输出.这是先前更改的结果:以前,由于我们的输出文件未达到最小深度,因此无法将其视为输入.
  • We're no longer specifying -mindepth, to ensure that our input files only come from subdirectories.
  • We're excluding R1.fastq and R2.fastq from our input files, so we never try to use the same file as both input and output. This is a consequence of the prior change: Previously, our output files couldn't be considered as input because they didn't meet the minimum depth.

这篇关于Bash脚本将文本文件与文件名中的特定子字符串连接在一起的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆