为什么C ++中的字符串比Python慢​​? [英] Why is splitting a string slower in C++ than Python?

查看:364
本文介绍了为什么C ++中的字符串比Python慢​​?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图将一些代码从Python转换为C ++,以获得一点点的速度和锐化我生锈的C ++技能。昨天我很震惊,当从stdin阅读线的幼稚的实现在Python比C ++快得多(见)。今天,我终于弄明白了如何使用合并分隔符(类似语义到python的split())在C ++中分割一个字符串,现在我遇到了deja vu!我的C ++代码需要花费更长的时间来完成这项工作(虽然不是一个数量级,如昨天的课程)。



Python代码: / strong>

 #!/ usr / bin / env python 
from __future__ import print_function
import time
import sys

count = 0
start_time = time.time()
dummy = None

在sys.stdin中的行:
dummy = line.split()
count + = 1

delta_sec = int(time.time() - start_time)
print(Python:Saw { 0}行在{1}秒内。.format(count,delta_sec),end ='')
如果delta_sec> 0:
lps = int(count / delta_sec)
print(Crunch Speed:{0}。format(lps))
else:
print(' b $ b

C ++代码:

  #include< iostream> 
#include< string>
#include< sstream>
#include< time.h>
#include< vector>

using namespace std;

void split1(vector< string>& tokens,const string& str,
const string& delimiters =){
//跳过分隔符开头
string :: size_type lastPos = str.find_first_not_of(delimiters,0);

//查找第一个非分隔符
string :: size_type pos = str.find_first_of(delimiters,lastPos);

while(string :: npos!= pos || string :: npos!= lastPos){
//找到一个令牌,将其添加到向量
tokens.push_back (str.substr(lastPos,pos-lastPos));
//跳过分隔符
lastPos = str.find_first_not_of(delimiters,pos);
//查找下一个非分隔符
pos = str.find_first_of(delimiters,lastPos);
}
}

void split2(vector< string>& tokens,const string& str,char delim =''){
stringstream ss ); // convert string to stream
string item;
while(getline(ss,item,delim)){
tokens.push_back(item); //将令牌添加到向量
}
}

int main(){
string input_line;
vector< string>样条;
long count = 0;
int sec,lps;
time_t start = time(NULL);

cin.sync_with_stdio(false); // disable synchronous IO

while(cin){
getline(cin,input_line);
spline.clear(); //为下一行清空向量parse

//我正在尝试两个实现之一,每次编译,显然:
// split1(spline,input_line);
split2(spline,input_line);

count ++;
};

count--; //减去最终的重读
sec =(int)time(NULL) - start;
cerr<< C ++:Saw< count<< lines in<< sec << 秒。 ;
if(sec> 0){
lps = count / sec;
cerr<< 咬合速度:< lps< endl;
} else
cerr<< endl;
return 0;

//编译时使用:g ++ -Wall -O3 -o split1 split_1.cpp

请注意,我尝试了两种不同的分割实现。一个(split1)使用字符串方法来搜索令牌,并且能够合并多个令牌以及处理多个令牌(它来自这里)。第二个(split2)使用getline作为流读取字符串,不合并分隔符,并且只支持一个测试字符(一个是由几个StackOverflow用户在字符串分割问题的答案中发布的)。



我以各种顺序运行了多次。我的测试机是一个Macbook Pro(2011,8GB,四核),不是那么重要。我测试一个20M行的文本文件,三个空格分隔的列,每个看起来都类似于:foo.bar 127.0.0.1 home.foo.bar



结果:

  $ / usr / bin / time cat test_lines_double | ./split.py 
15.61实际0.01用户0.38 sys
Python:在15秒内看到20000000行。紧缩速度:1333333
$ / usr / bin / time cat test_lines_double | ./split1
23.50 real 0.01 user 0.46 sys
C ++:在23秒内看到20000000行。压缩速度:869565
$ / usr / bin / time cat test_lines_double | ./split2
44.69 real 0.02 user 0.62 sys
C ++:在45秒内搜索20000000行。咬合速度:444444

我做错了什么?有没有更好的方法来做C ++中的字符串拆分不依赖于外部库(即没有boost),支持合并序列的分隔符(如python的分割),是线程安全的(所以没有strtok),其性能至少



编辑1 /部分解决方案?:试图通过使用python重置虚拟列表并且每次都附加到它来做一个更公平的比较,如C ++。这仍然不是C ++代码正在做什么,但它有点接近。基本上,循环现在是:

 对于sys.stdin中的行:
dummy = []
dummy + = line.split()
count + = 1

python的性能现在和split1 C ++实现相同。

  / usr / bin / time cat test_lines_double | ./split5.py 
22.61实际0.01用户0.40 sys
Python:在22秒内搜索20000000行。紧缩速度:909090



我还是感到惊讶,即使Python是这样优化字符串处理Matt Joiner建议),这些C ++实现不会更快。如果任何人有关于如何使用C ++更好的方式这样做的想法,请共享您的代码。 (我想我的下一步将尝试在纯C中实现这一点,虽然我不打算折衷程序员的生产力重新实现我的整体项目在C,所以这将只是一个实验的字符串拆分速度。)



感谢您的帮助。



最终编辑/解决方案 / p>

请参阅Alf接受的答案。由于python严格地通过引用来处理字符串,并且STL字符串经常被复制,所以使用香草python实现的性能更好。为了比较,我编译和运行我的数据通过Alf的代码,这里是在同一台机器上的性能与所有其他运行,本质上相同的naive python实现(虽然比重置/附加列表的python实现,比显示在上述编辑中):

  $ / usr / bin / time cat test_lines_double | ./split6 
15.09实际0.01用户0.45 sys
C ++:在15秒内看到20000000行。崩溃速度:1333333

我唯一小的遗憾是关于获得C ++执行所需的代码量在这种情况下。



从这个问题和昨天的stdin行读取问题(上面链接)的一个教训是,应该总是基准,而不是对语言的相对默认性能。我喜欢教育。



再次感谢您的建议!

解决方案

猜测,Python字符串是引用计数的不可变字符串,因此在Python代码中没有字符串被复制,而C ++ std :: string 是一个可变值类型,



如果目标是快速拆分,那么可以使用常量时间子串运算,这意味着只有指向



C ++ std :: string 类有一个兑换功能,但它是 标准,以便它可以用于安全和可移植的传递字符串,在那里效率不是主要考虑。但足够的聊天。代码 - 在我的机器上,这当然比Python快,因为Python的字符串处理是在C中实现的,它是C ++的一个子集(他他):

  #include< iostream> 
#include< string>
#include< sstream>
#include< time.h>
#include< vector>

using namespace std;

class StringRef
{
private:
char const * begin_;
int size_;

public:
int size()const {return size_; }
char const * begin()const {return begin_; }
char const * end()const {return begin_ + size_; }

StringRef(char const * const begin,int const size)
:begin_(begin)
,size_(size)
{}
} ;

vector< StringRef> split3(string const& str,char delimiter ='')
{
vector< StringRef>结果;

枚举状态{inSpace,inToken};

状态state = inSpace;
char const * pTokenBegin = 0; // Init以满足编译器。
for(auto it = str.begin(); it!= str.end(); ++ it)
{
State const newState =(* it == delimiter? inToken);
if(newState!= state)
{
switch(newState)
{
case inSpace:
result.push_back(StringRef(pTokenBegin,& * it - pTokenBegin));
break;
case inToken:
pTokenBegin =& * it;
}
}
state = newState;
}
if(state == inToken)
{
result.push_back(StringRef(pTokenBegin,& * str.end() - pTokenBegin));
}
return result;
}

int main(){
string input_line;
vector< string>样条;
long count = 0;
int sec,lps;
time_t start = time(NULL);

cin.sync_with_stdio(false); // disable synchronous IO

while(cin){
getline(cin,input_line);
//spline.clear(); //为下一行清空向量parse

//我正在尝试两个实现之一,每次编译,显然:
// split1(spline,input_line);
// split2(spline,input_line);

vector< StringRef> const v = split3(input_line);
count ++;
};

count--; //减去最终的重读
sec =(int)time(NULL) - start;
cerr<< C ++:Saw< count<< lines in<< sec << 秒。 ;
if(sec> 0){
lps = count / sec;
cerr<< 咬合速度:< lps< endl;
} else
cerr<< endl;
return 0;
}

//编译时使用:g ++ -Wall -O3 -o split1 split_1.cpp -std = c ++ 0x

免责声明:我希望没有任何错误。我没有测试的功能,但只检查了速度。但我认为,即使有一个或两个,纠正这不会显着影响速度。


I'm trying to convert some code from Python to C++ in an effort to gain a little bit of speed and sharpen my rusty C++ skills. Yesterday I was shocked when a naive implementation of reading lines from stdin was much faster in Python than C++ (see this). Today, I finally figured out how to split a string in C++ with merging delimiters (similar semantics to python's split()), and am now experiencing deja vu! My C++ code takes much longer to do the work (though not an order of magnitude more, as was the case for yesterday's lesson).

Python Code:

#!/usr/bin/env python
from __future__ import print_function                                            
import time
import sys

count = 0
start_time = time.time()
dummy = None

for line in sys.stdin:
    dummy = line.split()
    count += 1

delta_sec = int(time.time() - start_time)
print("Python: Saw {0} lines in {1} seconds. ".format(count, delta_sec), end='')
if delta_sec > 0:
    lps = int(count/delta_sec)
    print("  Crunch Speed: {0}".format(lps))
else:
    print('')

C++ Code:

#include <iostream>                                                              
#include <string>
#include <sstream>
#include <time.h>
#include <vector>

using namespace std;

void split1(vector<string> &tokens, const string &str,
        const string &delimiters = " ") {
    // Skip delimiters at beginning
    string::size_type lastPos = str.find_first_not_of(delimiters, 0);

    // Find first non-delimiter
    string::size_type pos = str.find_first_of(delimiters, lastPos);

    while (string::npos != pos || string::npos != lastPos) {
        // Found a token, add it to the vector
        tokens.push_back(str.substr(lastPos, pos - lastPos));
        // Skip delimiters
        lastPos = str.find_first_not_of(delimiters, pos);
        // Find next non-delimiter
        pos = str.find_first_of(delimiters, lastPos);
    }
}

void split2(vector<string> &tokens, const string &str, char delim=' ') {
    stringstream ss(str); //convert string to stream
    string item;
    while(getline(ss, item, delim)) {
        tokens.push_back(item); //add token to vector
    }
}

int main() {
    string input_line;
    vector<string> spline;
    long count = 0;
    int sec, lps;
    time_t start = time(NULL);

    cin.sync_with_stdio(false); //disable synchronous IO

    while(cin) {
        getline(cin, input_line);
        spline.clear(); //empty the vector for the next line to parse

        //I'm trying one of the two implementations, per compilation, obviously:
//        split1(spline, input_line);  
        split2(spline, input_line);

        count++;
    };

    count--; //subtract for final over-read
    sec = (int) time(NULL) - start;
    cerr << "C++   : Saw " << count << " lines in " << sec << " seconds." ;
    if (sec > 0) {
        lps = count / sec;
        cerr << "  Crunch speed: " << lps << endl;
    } else
        cerr << endl;
    return 0;

//compiled with: g++ -Wall -O3 -o split1 split_1.cpp

Note that I tried two different split implementations. One (split1) uses string methods to search for tokens and is able to merge multiple tokens as well as handle numerous tokens (it comes from here). The second (split2) uses getline to read the string as a stream, doesn't merge delimiters, and only supports a single delimeter character (that one was posted by several StackOverflow users in answers to string splitting questions).

I ran this multiple times in various orders. My test machine is a Macbook Pro (2011, 8GB, Quad Core), not that it matters much. I'm testing with a 20M line text file with three space-separated columns that each look similar to this: "foo.bar 127.0.0.1 home.foo.bar"

Results:

$ /usr/bin/time cat test_lines_double | ./split.py
       15.61 real         0.01 user         0.38 sys
Python: Saw 20000000 lines in 15 seconds.   Crunch Speed: 1333333
$ /usr/bin/time cat test_lines_double | ./split1
       23.50 real         0.01 user         0.46 sys
C++   : Saw 20000000 lines in 23 seconds.  Crunch speed: 869565
$ /usr/bin/time cat test_lines_double | ./split2
       44.69 real         0.02 user         0.62 sys
C++   : Saw 20000000 lines in 45 seconds.  Crunch speed: 444444

What am I doing wrong? Is there a better way to do string splitting in C++ that does not rely on external libraries (i.e. no boost), supports merging sequences of delimiters (like python's split), is thread safe (so no strtok), and whose performance is at least on par with python?

Edit 1 / Partial Solution?:

I tried making it a more fair comparison by having python reset the dummy list and append to it each time, as C++ does. This still isn't exactly what the C++ code is doing, but it's a bit closer. Basically, the loop is now:

for line in sys.stdin:
    dummy = []
    dummy += line.split()
    count += 1

The performance of python is now about the same as the split1 C++ implementation.

/usr/bin/time cat test_lines_double | ./split5.py
       22.61 real         0.01 user         0.40 sys
Python: Saw 20000000 lines in 22 seconds.   Crunch Speed: 909090

I still am surprised that, even if Python is so optimized for string processing (as Matt Joiner suggested), that these C++ implementations would not be faster. If anyone has ideas about how to do this in a more optimal way using C++, please share your code. (I think my next step will be trying to implement this in pure C, although I'm not going to trade off programmer productivity to re-implement my overall project in C, so this will just be an experiment for string splitting speed.)

Thanks to all for your help.

Final Edit/Solution:

Please see Alf's accepted answer. Since python deals with strings strictly by reference and STL strings are often copied, performance is better with vanilla python implementations. For comparison, I compiled and ran my data through Alf's code, and here is the performance on the same machine as all the other runs, essentially identical to the naive python implementation (though faster than the python implementation that resets/appends the list, as shown in the above edit):

$ /usr/bin/time cat test_lines_double | ./split6
       15.09 real         0.01 user         0.45 sys
C++   : Saw 20000000 lines in 15 seconds.  Crunch speed: 1333333

My only small remaining gripe is regarding the amount of code necessary to get C++ to perform in this case.

One of the lessons here from this issue and yesterday's stdin line reading issue (linked above) are that one should always benchmark instead of making naive assumptions about languages' relative "default" performance. I appreciate the education.

Thanks again to all for your suggestions!

解决方案

As a guess, Python strings are reference counted immutable strings, so that no strings are copied around in the Python code, while C++ std::string is a mutable value type, and is copied at the smallest opportunity.

If the goal is fast splitting, then one would use constant time substring operations, which means only referring to parts of the original string, as in Python (and Java, and C#…).

The C++ std::string class has one redeeming feature, though: it is standard, so that it can be used to pass strings safely and portably around where efficiency is not a main consideration. But enough chat. Code -- and on my machine this is of course faster than Python, since Python's string handling is implemented in C which is a subset of C++ (he he):

#include <iostream>                                                              
#include <string>
#include <sstream>
#include <time.h>
#include <vector>

using namespace std;

class StringRef
{
private:
    char const*     begin_;
    int             size_;

public:
    int size() const { return size_; }
    char const* begin() const { return begin_; }
    char const* end() const { return begin_ + size_; }

    StringRef( char const* const begin, int const size )
        : begin_( begin )
        , size_( size )
    {}
};

vector<StringRef> split3( string const& str, char delimiter = ' ' )
{
    vector<StringRef>   result;

    enum State { inSpace, inToken };

    State state = inSpace;
    char const*     pTokenBegin = 0;    // Init to satisfy compiler.
    for( auto it = str.begin(); it != str.end(); ++it )
    {
        State const newState = (*it == delimiter? inSpace : inToken);
        if( newState != state )
        {
            switch( newState )
            {
            case inSpace:
                result.push_back( StringRef( pTokenBegin, &*it - pTokenBegin ) );
                break;
            case inToken:
                pTokenBegin = &*it;
            }
        }
        state = newState;
    }
    if( state == inToken )
    {
        result.push_back( StringRef( pTokenBegin, &*str.end() - pTokenBegin ) );
    }
    return result;
}

int main() {
    string input_line;
    vector<string> spline;
    long count = 0;
    int sec, lps;
    time_t start = time(NULL);

    cin.sync_with_stdio(false); //disable synchronous IO

    while(cin) {
        getline(cin, input_line);
        //spline.clear(); //empty the vector for the next line to parse

        //I'm trying one of the two implementations, per compilation, obviously:
//        split1(spline, input_line);  
        //split2(spline, input_line);

        vector<StringRef> const v = split3( input_line );
        count++;
    };

    count--; //subtract for final over-read
    sec = (int) time(NULL) - start;
    cerr << "C++   : Saw " << count << " lines in " << sec << " seconds." ;
    if (sec > 0) {
        lps = count / sec;
        cerr << "  Crunch speed: " << lps << endl;
    } else
        cerr << endl;
    return 0;
}

//compiled with: g++ -Wall -O3 -o split1 split_1.cpp -std=c++0x

Disclaimer: I hope there aren't any bugs. I haven't tested the functionality, but only checked the speed. But I think, even if there is a bug or two, correcting that won't significantly affect the speed.

这篇关于为什么C ++中的字符串比Python慢​​?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆