升压字符串匹配DFA [英] boost string matching DFA

查看:92
本文介绍了升压字符串匹配DFA的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出一个字符串,我必须测试它是否以一组已知的后缀结尾.现在,由于后缀的数目不是很小,因此必须对照该已知后缀列表检查文档中的每个单词.单词和后缀中的每个字符均为char32_t.作为幼稚的迭代匹配将是昂贵的.尽管大多数后缀都不是子后缀或不是另一个后缀的前缀,但是大多数后缀都是由一小部分字符构成的.大多数支票将是未命中而不是成功.

Given a string I have to test whether that ends with a known set of suffixes. Now as the the of suffixes are not very small and every word in the document has to be checked against that list of known suffixes. Every character in the word and suffix is char32_t. As a naive iterative matching will be expensive. Though most of the suffixes are not sub suffix or prefix of another suffix, most of them are constructed with a small set of characters. Most of the checks will be a miss rather than being a hit.

因此,我想构建后缀的DFA以最大程度地减少错过的费用.我可以手动解析unicode代码点并使用boost-graph创建DFA.但是,是否有任何现有的库可以为我构建该库?

So I want to build a DFA of the suffixes to minimize the cost of miss. I can manually parse the unicode codepoints and create a DFA using boost-graph. But is there any existing library that will build that for me ?

包含正则表达式的大型正则表达式是否会比DFA便宜,因为正则表达式搜索还会构建一个以类似方式进行匹配的DFA?但是我想知道匹配时哪个后缀匹配.在正则表达式的情况下,我将需要执行另一个线性搜索以获取该信息(我无法标记正则表达式的内部DFA的顶点).我还需要unicode正则表达式.我想,仅在所有后缀中加上|便与线性搜索一样昂贵.我想我需要检查常见字符并使用lookahed和backbacks相应地创建正则表达式.手动构建DFA是否需要面对同样的困难?

Will a huge regex containing all the suffixes be less expensive than the DFA, because regex search also builds a DFA for matching in the similar way ? But I want to know which suffix was matched when there is a hit. In regex case I'll need to perform another linear search to get that (I cannot label the vertices of the regex's internal DFA). Also I need unicode regex. Just putting all the suffixes with a | will be as expensive as a linear search I guess. I think I need to check the common characters and create the regex accordingly with lookahed and lookbacks. Isn't it the same difficulty that I need to face to construct the DFA manually ?

我正在使用utf-32进行随机访问.但是,如果我可以轻松地将其转换为utf-8,则不是问题.我将从右到左反转字符串和模式.

I am using utf-32 for random access. However it is not a problem to switch to utf-8 if I can solve it easily with that. I will reverse the string and the pattern right to left.

推荐答案

您是否考虑过Spirit?当然,您没有指定如何在上下文中检测后缀(您是否需要在结尾处进行后缀,是否需要在其后加上一些语法等),但是您可以执行以下操作:

Have you considered Spirit? Of course you didn't specify how you detect suffixes in context (do you require them at the end, do you require some grammar preceding it etc.) but you could do something like this:

    x3::symbols<Char> sym;
    sym += "foo", "bar", "qux";

它构建了一个Trie,非常有效.它可以解析任何类型的输入迭代器(如果您愿意的话,还可以包括流).只需为上下文要求添加一些魔术约束,例如输入结束:

It builds a Trie, which is pretty effective. It can parse any kind of input iterator (including streams if you are so inclined). Just add a bit of magic constrain for contextual requirements, like end-of-input:

bool has_suffix(string_view sv) {
    return parse(sv.cbegin(), sv.cend(), x3::seek[suffix >> x3::eoi]);
}

如果您甚至希望返回字符串的文本值,只需执行以下操作:

If you even wish to return the text value of the string, simply do this:

string_view get_suffix(string_view sv) {
    boost::iterator_range<string_view::const_iterator> output;
    parse(sv.cbegin(), sv.cend(), x3::seek[x3::raw[suffix >> x3::eoi]], output);
    return {output.begin(), output.size()};
}

精神让您有很多自由地包围智能设备,可以动态添加/删除符号,例如将no_case与Trie等一起使用.

Spirit leaves you a lot of freedom to surround with smarts, dynamically add/remove symbols, e.g. use no_case with the Trie etc.

使用X3(c ++ 14)

Using X3 (c++14)

在Coliru上直播

#include <boost/spirit/home/x3.hpp>
#include <string_view>
#include <cstdint>

namespace Demo {
    using Char = char32_t;
    using string_view = std::basic_string_view<Char>;

    namespace x3 = boost::spirit::x3;

    static auto const suffix = [] {
        x3::symbols<Char> sym;
        sym += "foo", "bar", "qux";

        return sym; // x3::no_case[sym];
    }();

    bool has_suffix(string_view sv) {
        return parse(sv.cbegin(), sv.cend(), x3::seek[suffix >> x3::eoi]);
    }

    string_view get_suffix(string_view sv) {
        boost::iterator_range<string_view::const_iterator> output;
        parse(sv.cbegin(), sv.cend(), x3::seek[x3::raw[suffix >> x3::eoi]], output);
        return {output.begin(), output.size()};
    }
}

#include <iostream>
#include <iomanip>

int main() {
    using namespace Demo;

    auto widen = [](string_view sv) { return std::wstring(sv.begin(), sv.end()); };
    std::wcout << std::boolalpha;

    for (string_view testcase : { U"nope", U"lolbar you betqux" }) {
        std::wcout 
            << widen(testcase) 
            << L" -> " << has_suffix(testcase)
            << L" (" << widen(get_suffix(testcase))
            << L")\n";
    }
}

打印

nope -> false ()
lolbar you betqux -> true (qux)

Spirit Qi版本

文字端口: 在Coliru上直播

仅C ++ 11版本: 在Coliru上直播

A C++11 only version: Live On Coliru

以及具有真正复古编程经验的C ++ 03版本: 在Coliru上直播

And a C++03 version for the really retro programming experience: Live On Coliru

这篇关于升压字符串匹配DFA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆