为什么C ++ 11的正则表达式(的libc ++实现)是如此之慢? [英] why c++11 regex (libc++ implementation) is so slow?

查看:251
本文介绍了为什么C ++ 11的正则表达式(的libc ++实现)是如此之慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我和Linux下C正则表达式库相比,

I compared with Linux C regex library,

#include <iostream>
#include <chrono>
#include <regex.h>

int main()
{
    const int count = 100000;

    regex_t exp;
    int rv = regcomp(&exp, R"_(([a-zA-Z][a-zA-Z0-9]*)://([^ /]+)(/[^ ]*)?)_", REG_EXTENDED);
    if (rv != 0) {
            std::cout << "regcomp failed with " << rv << std::endl;
    }

    auto start = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < count; i++)
    {
            regmatch_t match;
            const char *sz = "http://www.abc.com";

            if (regexec(&exp, sz, 1, &match, 0) == 0) {
    //              std::cout << sz << " matches characters " << match.rm_so << " - " << match.rm_eo << std::endl;
            } else {
    //              std::cout << sz << " does not match" << std::endl;
            }
    }
    auto end = std::chrono::high_resolution_clock::now();
    auto elapsed = std::chrono::duration_cast<std::chrono::microseconds>(end - start);

    std::cout << elapsed.count() << std::endl;

    return 0;
}

结果是我的测试机上大约60-70毫秒。

The result is roughly 60-70 milliseconds on my testing machine.

然后我用的libc ++的图书馆,

Then I used libc++'s library,

#include <iostream>
#include <chrono>
#include <regex>


int main()
{
        const int count = 100000;

        std::regex rgx(R"_(([a-zA-Z][a-zA-Z0-9]*)://([^ /]+)(/[^ ]*)?)_", std::regex_constants::extended);
        auto start = std::chrono::high_resolution_clock::now();
        for (int i = 0; i < count; i++)
        {
                std::cmatch match;
                const char sz[] = "http://www.abc.com";

                if (regex_search(sz, match, rgx)) {
                } else {
                }
        }
        auto end = std::chrono::high_resolution_clock::now();
        auto elapsed = std::chrono::duration_cast<std::chrono::microseconds>(end - start);

        std::cout << "regex_search: " << elapsed.count() << std::endl;


        start = std::chrono::high_resolution_clock::now();
        for (int i = 0; i < count; i++)
        {
                const char sz[] = "http://www.abc.com";

                if (regex_match(sz, rgx)) {
                } else {
                }
        }
        end = std::chrono::high_resolution_clock::now();
        elapsed = std::chrono::duration_cast<std::chrono::microseconds>(end - start);

        std::cout << "regex_match: " << elapsed.count() << std::endl;

        return 0;
}

其结果是既regex_search&放大器大致2秒; regex_match。这是比C的regex.h库大约慢30倍。

The result is roughly 2 seconds for both regex_search & regex_match. This is about 30 times slower than C's regex.h library.

这有什么错我的比较呢?是C ++的正则表达式库不适合于高性能的情况下?

Is there anything wrong with my comparison? Is C++'s regex library not suitable for high performance case?

我能理解这是缓慢的,因为有一个在C ++的正则表达式库中没有的优化还没有,但是30倍慢实在太多。

I can understand it's slow because there's no optimization in c++'s regex library yet, but 30 times slower is just too much.

感谢。

大家好

谢谢回答。

我是用对不起,我的错[]对于C太多,但之后我改变了,忘了更改C ++ code。

Sorry for my mistake I was using [] for C too but later I changed, and forgot to change C++ code.

我做了两个变化,


  1. 我感动为const char SZ []圈外为C&放大器; C ++。

  2. I(我没有使用任何优化前)与-O2编译它,C库的实现仍是约60毫秒,但是libc中++的正则表达式现在给出了一些说,对于regex_search 1秒,150毫秒regex_match。

这仍然是一个有点慢,不过并不像原来的比较。

This is still a bit slow, but not as much as the original comparison.

推荐答案

如果您看一看<一个href=\"http://llvm.org/svn/llvm-project/libcxx/trunk/include/regex\">http://llvm.org/svn/llvm-project/libcxx/trunk/include/regex你会看到这个实施 regex_match 是分层之上 regex_search ,并且所有重载提取子前pression比赛位置即使只到该被扔掉本地临时。 regex_search 使用的__状态有物体<$一个矢量 C $ C> .resize()呼吁他们,所以是presumably向量太 - 所有堆分配和不必要的当SUBEX pression比赛不想要,但将需要跟踪支持 \\ 1 等在Perl样式扩展到正规的前pressions:老 regcomp / regexec C函数没有提供这些扩展功能从来没有做到这一点额外的工作。当然,这将是很好,如果铛实施检查的常规前pression的需要进行编译时跟踪比赛,并呼吁更精简,更快速的功能尽可能匹配,但我想他们只是开始,对于一般情况的支持。

If you take a look at http://llvm.org/svn/llvm-project/libcxx/trunk/include/regex you'll see this implementation of regex_match is layered atop regex_search, and all overloads extract sub-expression match positions even if only into local temporaries that are thrown away. regex_search uses a vector of __state objects that have .resize() called on them so are presumably vectors too - all heap allocations and unnecessary when the subexpression matches aren't wanted, but would need to be tracked to support \1 etc in perl-style extensions to regular expressions: the old regcomp/regexec C functions didn't provide those extended features never have to do this extra work. Of course it would be nice if the clang implementation checked the regular expression's need for tracking matches during compilation and called leaner, faster functions to match when possible, but I guess they're just starting with support for the general case.

这篇关于为什么C ++ 11的正则表达式(的libc ++实现)是如此之慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆