优化我的脚本以查找大压缩文件 [英] Optimising my script which lookups into a big compressed file

查看:54
本文介绍了优化我的脚本以查找大压缩文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我又在这里!我想优化我的bash脚本,以减少每个循环所花费的时间.基本上它是做什么的:

I'm here again ! I would like to optimise my bash script in order to lower the time spent for each loop. Basically what it does is :

  • 从tsv获取信息
  • 使用该信息通过awk查找文件
  • 打印并导出行

我的问题是:1)这些文件是60GB的压缩文件:我需要一个软件来解压缩它(我实际上正在尝试解压缩它,不确定我是否有足够的空间)2)反正要花很长的时间

My issues are : 1) the files are 60GB compressed files : I need a software to uncompress it (I'm actually trying now to uncompress it, not sure I'll have enough space) 2) It is long to look into it anyway

我的改进建议:

  1. 0),如果可能,我将解压缩文件
  2. 将GNU并行与 parallel -j 0一起使用./extract_awk_reads_in_bam.sh ::: reads_id_and_pos.tsv ,但是我不确定它是否按预期工作?我将每次研究的时间从36分钟减少到16分钟,所以只需2.5倍?(我有16个核心)

  1. 0) as said, if possible I'll decompress the file
  2. using GNU parallel with parallel -j 0 ./extract_awk_reads_in_bam.sh ::: reads_id_and_pos.tsv but I'm unsure it works as expected? I'm cutting down the time per research from 36 min to 16 so just a factor 2.5 ? (I have 16 cores)

我当时在想(但对GNU来说可能是多余的?)我的信息列表,可查看多个文件以启动它们平行

I was thinking (but It may be redundant with GNU?) to split down my list of info to look into into several files to launch them parallely

这是我的bash脚本的其余部分,我真的很乐于改进它,但是我不确定我是编程方面的超级巨星,所以也许保持简单会有所帮助吗?:)

Here is the rest of my bash script, I'm really open for ideas to improve it but I'm not sure I am a superstar in programming, so maybe keeping it simple would help? :)

我的bash脚本:

#/!bin/bash
while IFS=$'\t' read -r READ_ID_WH POS_HOTSPOT; do
echo "$(date -Iseconds) read id is : ${READ_ID_WH} with position ${POS_HOTSPOT}" >> /data/bismark2/reads_done_so_far.txt
echo "$(date -Iseconds) read id is : ${READ_ID_WH} with position ${POS_HOTSPOT}"
samtools view -@ 2 /data/bismark2/aligned_on_nDNA/bamfile.bam | awk -v read_id="$READ_ID_WH" -v pos_hotspot="$POS_HOTSPOT" '$1==read_id {printf $0 "\t%s\twh_genome",pos_hotspot}'| head -2 >> /data/bismark2/export_reads_mapped.tsv
done <"$1"

我的tsv文件的格式如下:

My tsv file has a format like :

READ_ABCDEF\t1200

非常感谢++

推荐答案

TL; DR

您的新脚本将是:

#!/bin/bash
samtools view -@ 2 /data/bismark2/aligned_on_nDNA/bamfile.bam | awk -v st="$1" 'BEGIN {OFS="\t"; while (getline < st) {st_array[$1]=$2}} {if ($1 in st_array) {print $0, st_array[$1], "wh_genome"}}'


您正在读取每个输入的整个文件.最好同时寻找所有这些对象.首先提取有趣的读物,然后对该子集应用第二个转换.


You are reading the entire file for each of the inputs. Better look for all of them at the same time. Start by extracting the interesting reads and then, on this subset, apply the second transformation.

samtools view -@ 2 "$bam" | grep -f <(awk -F$'\t' '{print $1}' "$1") > "$sam"

在这里,您将使用 samtools 获得所有读物,并搜索出现在 grep -f 参数中的所有术语.该参数是一个文件,其中包含搜索输入文件的第一列.输出是一个 sam 文件,仅包含搜索输入文件中列出的读数.

Here you are getting all the reads with samtools and searching for all the terms that appear in the -f parameter of grep. That parameter is a file that contains the first column of the search input file. The output is a sam file with only the reads that are listed in the search input files.

awk -v st="$1" 'BEGIN {OFS="\t"; while (getline < st) {st_array[$1]=$2}} {print $0, st_array[$1], "wh_genome"}' "$sam"

最后,使用awk添加其他信息:

Finally, use awk for adding the extra information:

  1. 以awk开头打开搜索输入文件,并将其内容读入数组( st_array )
  2. 将输出字段分隔符"设置为制表符
  3. 遍历sam文件,并从预填充的阵列中添加额外的信息.

我之所以提出这种模式,是因为我觉得 grep 的搜索速度要比 awk 快,但单独使用awk可以获得相同的结果:

I'm proposing this schema because I feel like grep is faster than awk for doing the search, but the same result can be obtained with awk alone:

samtools view -@ 2 "$bam" | awk -v st="$1" 'BEGIN {OFS="\t"; while (getline < st) {st_array[$1]=$2}} {if ($1 in st_array) {print $0, st_array[$1], "wh_genome"}}'

在这种情况下,您只需要添加一个条件即可识别出有趣的读物并摆脱 grep .

In this case, you only need to add a conditional to identify the interesting reads and get rid of the grep.

在任何情况下,您都需要多次读取文件或对其进行解压缩,然后再使用它.

In any case you need to re-read the file more than once or to decompress it before working with it.

这篇关于优化我的脚本以查找大压缩文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆