为什么awk脚本比C ++程序快? [英] why awk script is faster than C++ program?

查看:168
本文介绍了为什么awk脚本比C ++程序快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

背景:我是C ++中的摇滚



输入文件 p>

 
FCC5G2YACXX:5:1101:1224:2059#NNNNNNNN 97 genome 96003934 24 118M4D11M = 96004135 0 GCA .... ACG P \ .. GW:EO AS:i:-28 XN:i:0 XM:i:2 XO:i:1 XG:i:4 NM:i:6 MD:Z:54G53T9 ^ TACA11 YT:Z:UP

预期输出

 
96003934 98.31

说明输出



96003934



第18列: MD:Z:54G53T9 ^ TACA11



match = 54 + 53 + 9 = 116



mismatch = count_letter(54G53T9)= 2



id = 116 * 100 /(116 + 2)= 98.30508474576272



awk script

  awk'{
split($ 18,v,/ [\ ^:] /);
nmatch = split(v [3],vmatch,/ [^ 0-9] /);
cmatch = 0;
for(i = 1; i <= nmatch; i ++)cmatch + = vmatch [i];
printf(%sOFS%。2f \\\
,$ 4,cmatch * 100 /(cmatch + nmatch-1));
}'file.sam

C ++ 更快

  #include< iostream> 
#include< string>
#include< vector>
#include< sstream>
#include< algorithm>
#include< iterator>
#include< iomanip>

using namespace std;

int main(){
string line;
while(getline(cin,line)){
istringstream iss(line);
vector< string>列;
copy(istream_iterator< string>(iss),// Split line by spaces
istream_iterator< string>(),
back_inserter
//我从列18中提取信息
int start = columns [17] .find_last_of(':');
int end = columns [17] .find_first_of('^');
string smatch = columns [17] .substr(start + 1,end-start-1);
//我得到例如54G53T9
replace(smatch.begin(),smatch.end(),'A','');
replace(smatch.begin(),smatch.end(),'C','');
replace(smatch.begin(),smatch.end(),'G','');
replace(smatch.begin(),smatch.end(),'T','');
//我得到例如54 53 9
istringstream iss_sum(smatch);
int n = 0,sum = 0,count = 0;
while(iss_sum>> n){
sum + = n;
count ++;
}
cout<<列[3]< ''<<固定< setprecision(2)
<< (float)sum * 100 /(sum + count-1) endl;
}
}

基准



输入1行输入....




  • awk:0m6.102s

  • C ++:0m15.814s



问题 >

我做错了什么,因此 C ++ 工作缓慢? .....我可以改进C ++程序吗?如果是,怎么办? .....我应该写在 C ? ....



预先感谢

解决方案

C ++ iostreams don实际上提供了一个很好的方法来检查列在某些输入中存在,否则忽略它。 C ++ iostreams有一个忽略,但它不适合这种特殊情况很好,所以它可能不会帮助。



在这种情况下,我至少考虑使用 scanf ,可能是这个一般的顺序:

  #include< stdio.h> 
#include< string.h>
#include< ctype.h>
#include< string>
#include< iostream>
#include< numeric>

int main(){
char column4 [256];
char column17 [256];

while(2 == scanf(%* s%* s%* s%255s%* s%* s%* s%* s%* s%* s%* s%* s + s%* s%* s%* s%* s%* s%255s%* s,column4,column17){
char * beg = strrchr(column17,':')
char * end = strchr(column17,'^');

* end ='\0';

int nums [5];

int count = sscanf(beg,%d%* [AZ]%d%* [AZ]%d%* [AZ]%d%* [AZ]%d nums + 1,nums + 2,nums + 3,nums + 4);


int sum = std :: accumulate(nums,nums + count,0);

double result =(sum * 100.0)/(sum + count-1);
printf(%s%2.2f\\\
,column4,result);
}
}

暂时,这假设我不得不猜测的东西),第17(或者我计数为第18,但任何)列可以忽略从开始到最后一个冒号() 。然后有一些任意数量的重复的数字,然后是字母,另一个数字,另一个字母,等等(假设开始和结束的数字)。目前,我允许多达5个数字,但允许更多的是微不足道的。允许在模式中更多变化可能需要更多的工作(取决于可能发生的变化类型)。



为了更多的速度提高,你可以使用一些更大的输入缓冲区,像这样:

  setvbuf(stdin,NULL,_IOFBF,65536); 

您需要/想在读取任何内容之前执行此操作,因此它会在之前,而 loop。确切地说,这将会做多少好处(如果有的话)似乎有所不同,但是很容易做到这一点,至少应该尝试看看它是否有什么区别。


Background: I'm rockie in C++

Input file: 1 millon lines like to

FCC5G2YACXX:5:1101:1224:2059#NNNNNNNN   97  genome  96003934    24  118M4D11M   =   96004135    0   GCA....ACG  P\..GW^EO   AS:i:-28    XN:i:0  XM:i:2  XO:i:1  XG:i:4  NM:i:6  MD:Z:54G53T9^TACA11 YT:Z:UP

Output expected

96003934 98.31

Explanation output

Column 4: 96003934

Column 18: MD:Z:54G53T9^TACA11

match = 54+53+9 = 116

mismatch = count_letter(54G53T9) = 2

id = 116*100 / (116+2) = 98.30508474576272

awk script

awk '{
    split($18,v,/[\^:]/); 
    nmatch = split(v[3],vmatch, /[^0-9]/); 
    cmatch=0; 
    for(i=1; i<=nmatch; i++) cmatch+=vmatch[i]; 
    printf("%s"OFS"%.2f\n", $4, cmatch*100/(cmatch+nmatch-1));
}' file.sam

C++, I thought would be faster

#include <iostream>
#include <string>
#include <vector>
#include <sstream>
#include <algorithm>
#include <iterator>
#include <iomanip>

using namespace std;

int main(){
  string line;
  while(getline(cin, line)){
    istringstream iss(line);
    vector<string> columns;
    copy(istream_iterator<string>(iss),    //Split line by spaces
         istream_iterator<string>(),
         back_inserter(columns));
    //I extract information from column 18
    int start = columns[17].find_last_of(':');
    int end = columns[17].find_first_of('^');
    string smatch = columns[17].substr(start+1, end-start-1);
    // I get for example "54G53T9"
    replace( smatch.begin(), smatch.end(), 'A', ' ');
    replace( smatch.begin(), smatch.end(), 'C', ' ');
    replace( smatch.begin(), smatch.end(), 'G', ' ');
    replace( smatch.begin(), smatch.end(), 'T', ' ');
    // I get for example "54 53 9"
    istringstream iss_sum(smatch);
    int n=0, sum=0, count=0;
    while(iss_sum >> n){
      sum += n;
      count++;
    }
    cout << columns[3] << ' ' << fixed << setprecision(2) 
         << (float)sum*100 / (sum+count-1) << endl;
  }
}

Benchmark

with 1 millon of lines in input ....

  • awk: 0m6.102s
  • C++: 0m15.814s

Question

what am I doing wrong so that C++ works slowly ? ..... can I improve C++ program? if yes, how? ..... should I write in C ? ....

thank in advance

解决方案

C++ iostreams don't really provide a good way of checking that a column exists in some input, but otherwise ignoring it. C++ iostreams have an ignore, but it doesn't fit this particular case very well, so it probably won't help.

That being the case, I'd at least consider using scanf instead, possibly something on this general order:

#include <stdio.h>
#include <string.h>
#include <ctype.h>
#include <string>
#include <iostream>
#include <numeric>

int main() {
    char column4[256];
    char column17[256];

    while (2 == scanf("%*s %*s %*s %255s %*s %*s %*s %*s %*s %*s %*s %*s %*s %*s %*s %*s %*s %255s %*s", column4, column17)) {
        char *beg = strrchr(column17, ':') + 1;
        char *end = strchr(column17, '^');

        *end = '\0';

        int nums[5];

        int count = sscanf(beg, "%d%*[A-Z]%d%*[A-Z]%d%*[A-Z]%d%*[A-Z]%d", nums, nums + 1, nums + 2, nums + 3, nums + 4);


        int sum = std::accumulate(nums, nums + count, 0);

        double result = (sum*100.0) / (sum + count-1);
        printf("%s %2.2f\n", column4, result);
    }
}

For the moment, this assumes (perhaps incorrectly, but I had to guess at something) that the 17th (or, I count it as 18th, but whatever) column can be ignored from the beginning up to the last colon (:). Then there's some arbitrary number of repetitions of number, then letter, another number, another letter, and so on (presumed for the moment to start and end with numbers). For the moment, I've allowed for up to 5 numbers, but allowing more would be trivial. Allowing for more variation in the pattern might take a little more work (depending on what sort of variation can happen.

For a little more speed improvement, you could use a somewhat larger input buffer, something like this:

setvbuf(stdin, NULL, _IOFBF, 65536);

You need/want to do this before reading anything, so it would go before the while loop. Exactly how much good this will do (if any) seems to vary, but it's easy enough to do that it's worth at least trying it to see if it makes any difference.

这篇关于为什么awk脚本比C ++程序快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆