如何计算字符串匹配后的行数,直到发生下一个especific匹配 [英] how can I count number of lines after a string match until next especific match occurs

查看:95
本文介绍了如何计算字符串匹配后的行数,直到发生下一个especific匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个具有以下结构的文件(请参见下文),我需要帮助以找到匹配每个> Cluster"字符串的方法,并针对每种情况计算直到下一个>" Cluster的行数,以此类推直到文件末尾.

I have a file with the following structure (see below), I need help to find the way to match every ">Cluster" string, and for every case count the number of lines until the next ">cluster" and so on until the end of the file.

>Cluster 0
0       10565nt, >CL9602.Contig1_All... *
1       1331nt, >CL9602.Contig2_All... at -/98.05%
>Cluster 1
0       3798nt, >CL3196.Contig1_All... at +/97.63%
1       9084nt, >CL3196.Contig3_All... *
>Cluster 2
0       8710nt, >Unigene21841_All... *
>Cluster 3
0       8457nt, >Unigene10299_All... *

所需的输出应如下所示:

The desired Output should look like below:

Cluster 0  2 
Cluster 1  2
Cluster 2  1
Cluster 3  1

我尝试使用awk进行如下操作,但是它只给我行号.

I tried with awk as below, but it gives me only the line numbers.

awk '{print FNR "\t" $0}' All-Unigene_Clustered.fa.clstr | head - 20
==> standard input <==
1       >Cluster 0
2       0       10565nt, >CL9602.Contig1_All... *
3       1       1331nt, >CL9602.Contig2_All... at -/98.05%
4       >Cluster 1
5       0       3798nt, >CL3196.Contig1_All... at +/97.63%
6       1       9084nt, >CL3196.Contig3_All... *
7       >Cluster 2
8       0       8710nt, >Unigene21841_All... *
9       >Cluster 3
10      0       8457nt, >Unigene10299_All... *

我也尝试使用sed,但是它只打印行,甚至省略了一些行.

I also tried with sed, but it only prints the lines while even ommiting some lines.

sed -n -e '/>Cluster/,/>Cluster/ p' All-Unigene_Clustered.fa.clstr | head             
>Cluster 0
0       10565nt, >CL9602.Contig1_All... *
1       1331nt, >CL9602.Contig2_All... at -/98.05%
>Cluster 1
>Cluster 2
0       8710nt, >Unigene21841_All... *
>Cluster 3
>Cluster 4
0       1518nt, >CL2313.Contig1_All... at -/95.13%
1       8323nt, >CL2313.Contig8_All... *

此外,我尝试了awk并将sed与'wc'结合使用,但它只为我提供了字符串匹配的总发生次数.

In addition I tried awk and sed in combination with 'wc' but it gives me only the total count of occurrencies for the string match.

我想使用grep的-v选项减去不匹配字符串'> cluster'的行,然后减去匹配字符串'> Cluster'的每一行,并将两者都添加到新文件中,例如

I thought subtracting the lines not matching the string '>cluster' using the -v option of grep, then substracting every line matching the string '>Cluster' and adding both to a new file, e.g

grep -vw '>Cluster' All-Unigene_Clustered.fa.clstr | head
0       10565nt, >CL9602.Contig1_All... *
1       1331nt, >CL9602.Contig2_All... at -/98.05%
0       3798nt, >CL3196.Contig1_All... at +/97.63%
1       9084nt, >CL3196.Contig3_All... *
0       8710nt, >Unigene21841_All... *
0       8457nt, >Unigene10299_All... *
0       1518nt, >CL2313.Contig1_All... at -/95.13%

grep -w '>Cluster' All-Unigene_Clustered.fa.clstr | head
>Cluster 0
>Cluster 1
>Cluster 2
>Cluster 3
>Cluster 4

,但是问题是每个'> Cluster'之后的行数不是恒定的,每个'> Cluster'字符串后跟1、2、3或更多行,直到出现下一个字符串.

but the problem is that the number of lines following each '>Cluster' isn't constant, each '>Cluster' string is followed by 1, 2, 3 or more lines until the next string occurs.

在广泛寻求以前提出的问题的帮助之后,我决定发布我的问题,但是我找不到任何有用的答案.

I have decided to post my question after extensively searching for help within previously ansewred questions but I could't find any helpful answer.

谢谢

推荐答案

能否请您尝试以下操作.

Could you please try following.

awk '
/^>Cluster/{
  if(count){
    print prev,count
  }
  sub(/^>/,"")
  prev=$0
  count=""
  next
}
{
  count++
}
END{
  if(count && prev){
    print prev,count
  }
}
' Input_file

说明: 为上述代码添加说明.

Explanation: Adding explanation for above code.

awk '                      ##Starting awk program from here.
/^>Cluster/{               ##Checking condition if a line is having string Cluster then do following.
  if(count){               ##Checking condition if variable count is NOT NULL then do following.
    print prev,count       ##Printing prev and count variable here.
  }                        ##Closing BLOCK for if condition here.
  sub(/^>/,"")             ##Using sub for substitution of starting > with NULL in current line.
  prev=$0                  ##Creating a variable named prev whose value is current line.
  count=""                 ##Nullifying count variable here.
  next                     ##next will skip all further statements from here.
}                          ##Closing BLOCK for Cluster condition here.
{
  count++                  ##Doing increment of variable count each time cursor comes here.
}
END{                       ##Mentioning END BLOCK for this program.
  if(count && prev){       ##Checking condition if variable count and prev are NOT NULL then do following.
    print prev,count       ##Printing prev and count variable here.
  }                        ##Closing BLOCK for if condition here.
}                          ##Closing BLOCK for END BLOCK of this program.
' Input_file               ##Mentioning Input_file name here.

输出如下.

Cluster 0 2
Cluster 1 2
Cluster 2 1
Cluster 3 1

这篇关于如何计算字符串匹配后的行数,直到发生下一个especific匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆