字符串匹配性能:gcc与CPython [英] String matching performance: gcc versus CPython

查看:160
本文介绍了字符串匹配性能:gcc与CPython的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在研究Python和C ++之间的性能权衡时,我设计了一个小例子,主要集中在一个哑子字符串匹配。



这里是相关的C ++:

 使用std :: string; 
std :: vector< string>火柴;
std :: copy_if(patterns.cbegin(),patterns.cend(),back_inserter(matches),
[& fileContents](const string& pattern){return fileContents.find(pattern) != string :: npos;});

以上是用-O3构建的。



这里是Python:

  def getMatchingPatterns(patterns,text):
return filter(text .__ contains__,模式)

两者都需要一组大型模式和输入文件,



这些版本是:




  • gcc - 4.8.2(Ubuntu)和4.9.2(cygwin)

  • python - 2.7.6(Ubuntu) li>


令我惊讶的是性能。我在一个低规格的Ubuntu和Python运行速度快了约20%。同样在中规格PC与cygwin - Python两倍快。
Profiler显示,99 +%的周期用于字符串匹配(字符串复制和列表推导是无关紧要的)。



显然,Python实现是本机C ,我希望与C ++大致相同,但没有预期它会快。



与gcc相比,任何深入了解相关的CPython优化将是最欢迎。



为了参考,这里是完整的例子。输入只需要一组50K HTLM(在每次测试中都从磁盘读取,没有特殊的缓存):



Python:

  import sys 

def getMatchingPatterns(patterns,text):
return filter(text .__ contains__,patterns)

def serialScan(文件名,模式):
return zip(文件名,[getMatchingPatterns(patterns,open(filename).read())for file in fileenames])

if __name__ ==__main__:
with open(sys.argv [1])as filenamesListFile:
filenames = filenamesListFile.read()。split()
with open ])as patternsFile:
patterns = patternsFile.read()。

resultTuple = serialScan(文件名,模式)
文件名,模式在resultTuple:
print':'.join([filename,','。join(patterns)])

C ++:

  #include< iostream> 
#include< iterator>
#include< fstream>
#include< string>
#include< vector>
#include< unordered_map>
#include< algorithm>

using namespace std;
使用MatchResult = unordered_map< string,vector< string>>
static const size_t PATTERN_RESERVE_DEFAULT_SIZE = 5000;

MatchResult serialMatch(const vector< string>& filenames,const vector< string& patterns)
{
MatchResult res;
for(auto& filename:filenames)
{
ifstream file(filename);
const string fileContents((istreambuf_iterator< char>(file)),
istreambuf_iterator< char>());
vector< string>火柴;
std :: copy_if(patterns.cbegin(),patterns.cend(),back_inserter(matches),
[& fileContents](const string& pattern){return fileContents.find(pattern) != string :: npos;});

res.insert(make_pair(filename,std :: move(matches)));
}
return res;
}

int main(int argc,char ** argv)
{
vector< string&文件名
ifstream filenamesListFile(argv [1]);
std :: copy(istream_iterator< string>(filenamesListFile),istream_iterator< string>(),
back_inserter(filenames)

vector< string>模式;
patterns.reserve(PATTERN_RESERVE_DEFAULT_SIZE);
ifstream patternsFile(argv [2]);
std :: copy(istream_iterator< string>(patternsFile),istream_iterator< string>(),
back_inserter

auto matchResult = serialMatch(filenames,patterns);

for(const auto& matchItem:matchResult)
{
cout< matchItem.first<< :;
for(const auto& matchString:matchItem.second)
cout<< matchString<< ,;
cout<< endl;
}
}


解决方案

python 3.4 code b'abc'in b'abcabc'(或 b'abcabc'.__包含__(b'abc'),如示例中所示)执行 bytes_contains 方法,它会调用内联函数 stringlib_find ;它将搜索委托给 FASTSEARCH



FASTSEARCH 函数使用修改的 Boyer-Moore 搜索算法:


快速搜索/计数实现,基于boyer-
moore和horspool之间的混合,在顶部有更多的钟声和口哨。
以了解更多背景,请参阅: http://effbot.org/zone/stringlib.htm p>

还有一些修改,如注释所示:


注意:fastsearch可以访问 s [n] ,这在使用
时并不是一个问题Python的普通字符串类型,你是
在其他上下文中使用此代码。如果在目标字符串中不能匹配,则计数模式返回 -1
,并且 0 如果
它实际检查了匹配,但没有找到任何。 $ b /gcc.gnu.org/onlinedocs/gcc-4.9.0/libstdc++/api/a00999_source.html#l00734\">GNU C ++标准库 basic_string< T> :: find() 实现是尽可能通用的(和哑的);






TL ; DR :C ++标准库比Python慢​​的原因是因为它试图在 std :: basic_string< char> ,但是没有为更有趣的情况有效地做;而在Python中,程序员可以免费获得最有效的算法。


Whilst researching performance trade-offs between Python and C++, I've devised a small example, which mostly focusses on a dumb substring matching.

Here is the relevant C++:

using std::string;
std::vector<string> matches;
std::copy_if(patterns.cbegin(), patterns.cend(), back_inserter(matches),
   [&fileContents] (const string &pattern) { return fileContents.find(pattern) != string::npos; } );

The above is built with -O3.

And here is Python:

def getMatchingPatterns(patterns, text):
    return filter(text.__contains__, patterns)

Both of them take a large-ish set of patterns and input file, and filter down the list of patterns to the ones found in the file using a dumb substring search.

The versions are:

  • gcc - 4.8.2 (Ubuntu) and 4.9.2 (cygwin)
  • python - 2.7.6 (Ubuntu) and 2.7.8 (cygwin)

What was surprising to me is the performance. I've run both on a low-spec Ubuntu and Python was faster by about 20%. The same on mid-spec PC with cygwin - Python twice faster. Profiler shows that 99+% of cycles are spent in string matching (string copying and list comprehensions are insignificant).

Obviously, the Python implementation is native C, and I'd expected to be roughly the same as C++, but did not expect it as fast.

Any insight into relevant CPython optimisations in comparison to gcc would be most welcome.

For reference, here are the full examples. The inputs just take a set of 50K HTLMs (all read from disk in each test, no special caching):

Python:

import sys

def getMatchingPatterns(patterns, text):
   return filter(text.__contains__, patterns)

def serialScan(filenames, patterns):
   return zip(filenames, [getMatchingPatterns(patterns, open(filename).read()) for filename in filenames])

if __name__ == "__main__":
   with open(sys.argv[1]) as filenamesListFile:
      filenames = filenamesListFile.read().split()
   with open(sys.argv[2]) as patternsFile:
      patterns = patternsFile.read().split()

   resultTuple = serialScan(filenames, patterns)
   for filename, patterns in resultTuple:
      print ': '.join([filename, ','.join(patterns)])

C++:

#include <iostream>
#include <iterator>
#include <fstream>
#include <string>
#include <vector>
#include <unordered_map>
#include <algorithm>

using namespace std;
using MatchResult = unordered_map<string, vector<string>>;
static const size_t PATTERN_RESERVE_DEFAULT_SIZE = 5000;

MatchResult serialMatch(const vector<string> &filenames, const vector<string> &patterns)
   {
   MatchResult res;
   for (auto &filename : filenames)
      {
      ifstream file(filename);
      const string fileContents((istreambuf_iterator<char>(file)),
                                         istreambuf_iterator<char>());
      vector<string> matches;
      std::copy_if(patterns.cbegin(), patterns.cend(), back_inserter(matches),
                   [&fileContents] (const string &pattern) { return fileContents.find(pattern) != string::npos; } );

      res.insert(make_pair(filename, std::move(matches)));
      }
   return res;
   }

int main(int argc, char **argv)
    {
    vector<string> filenames;
    ifstream filenamesListFile(argv[1]);
    std::copy(istream_iterator<string>(filenamesListFile), istream_iterator<string>(),
             back_inserter(filenames));

    vector<string> patterns;
    patterns.reserve(PATTERN_RESERVE_DEFAULT_SIZE);
    ifstream patternsFile(argv[2]);
    std::copy(istream_iterator<string>(patternsFile), istream_iterator<string>(),
             back_inserter(patterns));

    auto matchResult = serialMatch(filenames, patterns);

    for (const auto &matchItem : matchResult)
      {
      cout << matchItem.first << ": ";
      for (const auto &matchString : matchItem.second)
         cout << matchString << ",";
      cout << endl;
      }
    }

解决方案

The python 3.4 code b'abc' in b'abcabc' (or b'abcabc'.__contains__(b'abc') as in your example) executes the bytes_contains method, which in turn calls the inlined function stringlib_find; which delegates the search to FASTSEARCH.

The FASTSEARCH function then uses a modified Boyer-Moore search algorithm:

fast search/count implementation, based on a mix between boyer- moore and horspool, with a few more bells and whistles on the top. for some more background, see: http://effbot.org/zone/stringlib.htm

There are some modifications too, as noted by the comments:

note: fastsearch may access s[n], which isn't a problem when using Python's ordinary string types, but may cause problems if you're using this code in other contexts. also, the count mode returns -1 if there cannot possible be a match in the target string, and 0 if it has actually checked for matches, but didn't find any. callers beware!


The GNU C++ Standard Library basic_string<T>::find() implementation is as generic (and dumb) as possible; it just tries dumbly matching the pattern at each and every consecutive character position until it finds the match.


TL;DR: The reason why the C++ standard library is so slow compared to Python is because it tries to do a generic algorithm on top of std::basic_string<char>, but fails to do it efficiently for the more interesting cases; whereas in Python the programmer gets the most efficient algorithms on case-by-case basis for free.

这篇关于字符串匹配性能:gcc与CPython的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆