Spirit X3:具有内部状态的解析器 [英] Spirit X3: parser with internal state

查看:54
本文介绍了Spirit X3:具有内部状态的解析器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想有效地解析大型CSV格式的文件,这些文件在运行时会获得列的顺序.使用Spirit Qi,我将使用lazy辅助解析器解析每个字段,该辅助解析器将在运行时选择将哪个列特定的解析器应用于每个列.但是X3似乎没有lazy(尽管它是

XY:我的目标是在合理的时间内在一台RAM少的计算机上解析约500 GB的伪CSV文件,将其转换为(大致)[行号,列名,值]的列表,然后入库.格式实际上比CSV稍微复杂一些:数据库转储以……人类友好的方式进行格式化,列值实际上是几个小的子语言(例如,日期或,嗯,类似于填充到单个字段中的整个apache日志行),而且我经常只提取每一列的一个特定部分.不同的文件可能具有不同的列和不同的顺序,我只能通过解析另一组包含原始查询的文件来学习.值得庆幸的是,Spirit使它变得轻而易举……

三个答案:

  1. 最简单的解决方法是使pos成为mutable成员
  2. X3核心答案是x3::with<>
  3. 功能组成

1.使pos可变

在魔盒上直播 >

#include <iostream>
#include <variant>

#include <boost/spirit/home/x3.hpp>
#include <boost/spirit/home/support.hpp>

namespace helpers {
    // https://bitbashing.io/std-visit.html
    template<class... Ts> struct overloaded : Ts... { using Ts::operator()...; };
    template<class... Ts> overloaded(Ts...) -> overloaded<Ts...>;
}

auto const unquoted_text_field = *(boost::spirit::x3::char_ - ',' - boost::spirit::x3::eol);

struct text { };
struct integer { };
struct real { };
struct skip { };
typedef std::variant<text, integer, real, skip> column_variant;

struct column_value_parser : boost::spirit::x3::parser<column_value_parser> {
    typedef boost::spirit::unused_type attribute_type;

    std::vector<column_variant>& columns;
    size_t mutable pos = 0;
    struct pos_tag;

    column_value_parser(std::vector<column_variant>& columns)
        : columns(columns)
    { }

    template<typename It, typename Ctx, typename Other, typename Attr>
    bool parse(It& f, It l, Ctx& /*ctx*/, Other const& /*other*/, Attr& /*attr*/) const {
        auto const saved_f = f;
        bool successful = false;

        visit(
            helpers::overloaded {
                [&](skip const&) {
                    successful = boost::spirit::x3::parse(f, l, boost::spirit::x3::omit[unquoted_text_field]);
                },
                [&](text&) {
                    std::string value;
                    successful = boost::spirit::x3::parse(f, l, unquoted_text_field, value);
                    if(successful) {
                        std::cout << "Text: " << value << '\n';
                    }
                },
                [&](integer&) {
                    int value;
                    successful = boost::spirit::x3::parse(f, l, boost::spirit::x3::int_, value);
                    if(successful) {
                        std::cout << "Integer: " << value << '\n';
                    }
                },
                [&](real&) {
                    double value;
                    successful = boost::spirit::x3::parse(f, l, boost::spirit::x3::double_, value);
                    if(successful) {
                        std::cout << "Real: " << value << '\n';
                    }
                }
            },
            columns[pos]);

        if(successful) {
            pos = (pos + 1) % columns.size();
            return true;
        } else {
            f = saved_f;
            return false;
        }
    }
};


int main() {
    std::string input = "Hello,1,13.7,XXX\nWorld,2,1e3,YYY";

    std::vector<column_variant> columns = {text{}, integer{}, real{}, skip{}};

    boost::spirit::x3::parse(
        input.begin(), input.end(),
        (column_value_parser(columns) % ',') % boost::spirit::x3::eol);
}

2. x3::with<>

相似,但具有更好的(重新)入口和封装性:

在魔盒上直播 >

#include <iostream>
#include <variant>

#include <boost/spirit/home/x3.hpp>
#include <boost/spirit/home/support.hpp>

namespace helpers {
    // https://bitbashing.io/std-visit.html
    template<class... Ts> struct overloaded : Ts... { using Ts::operator()...; };
    template<class... Ts> overloaded(Ts...) -> overloaded<Ts...>;
}

auto const unquoted_text_field = *(boost::spirit::x3::char_ - ',' - boost::spirit::x3::eol);

struct text { };
struct integer { };
struct real { };
struct skip { };
typedef std::variant<text, integer, real, skip> column_variant;

struct column_value_parser : boost::spirit::x3::parser<column_value_parser> {
    typedef boost::spirit::unused_type attribute_type;

    std::vector<column_variant>& columns;

    column_value_parser(std::vector<column_variant>& columns)
        : columns(columns)
    { }

    template<typename It, typename Ctx, typename Other, typename Attr>
    bool parse(It& f, It l, Ctx const& ctx, Other const& /*other*/, Attr& /*attr*/) const {
        auto const saved_f = f;
        bool successful = false;

        size_t& pos = boost::spirit::x3::get<pos_tag>(ctx).value;

        visit(
            helpers::overloaded {
                [&](skip const&) {
                    successful = boost::spirit::x3::parse(f, l, boost::spirit::x3::omit[unquoted_text_field]);
                },
                [&](text&) {
                    std::string value;
                    successful = boost::spirit::x3::parse(f, l, unquoted_text_field, value);
                    if(successful) {
                        std::cout << "Text: " << value << '\n';
                    }
                },
                [&](integer&) {
                    int value;
                    successful = boost::spirit::x3::parse(f, l, boost::spirit::x3::int_, value);
                    if(successful) {
                        std::cout << "Integer: " << value << '\n';
                    }
                },
                [&](real&) {
                    double value;
                    successful = boost::spirit::x3::parse(f, l, boost::spirit::x3::double_, value);
                    if(successful) {
                        std::cout << "Real: " << value << '\n';
                    }
                }
            },
            columns[pos]);

        if(successful) {
            pos = (pos + 1) % columns.size();
            return true;
        } else {
            f = saved_f;
            return false;
        }
    }

    template <typename T>
    struct Mutable { T mutable value; };
    struct pos_tag;

    auto invoke() const {
        return boost::spirit::x3::with<pos_tag>(Mutable<size_t>{}) [ *this ];
    }
};


int main() {
    std::string input = "Hello,1,13.7,XXX\nWorld,2,1e3,YYY";

    std::vector<column_variant> columns = {text{}, integer{}, real{}, skip{}};
    column_value_parser p(columns);

    boost::spirit::x3::parse(
        input.begin(), input.end(),
        (p.invoke() % ',') % boost::spirit::x3::eol);
}

3.功能组成

因为它在X3中非常容易,所以我最喜欢的是按需生成解析器.

没有要求,这是我建议的最简单的方法:

在魔盒上直播 >

#include <boost/spirit/home/x3.hpp>
namespace x3 = boost::spirit::x3;

namespace CSV {
    struct text    { };
    struct integer { };
    struct real    { };
    struct skip    { };

    auto const unquoted_text_field = *~x3::char_(",\n");
    static inline auto as_parser(skip)    { return x3::omit[unquoted_text_field]; }
    static inline auto as_parser(text)    { return unquoted_text_field;           }
    static inline auto as_parser(integer) { return x3::int_;                      }
    static inline auto as_parser(real)    { return x3::double_;                   }

    template <typename... Spec>
    static inline auto line_parser(Spec... spec) {
        auto delim = ',' | &(x3::eoi | x3::eol);
        return ((as_parser(spec) >> delim) >> ... >> x3::eps);
    }

    template <typename... Spec> static inline auto csv_parser(Spec... spec) {
        return line_parser(spec...) % x3::eol;
    }
}

#include <iostream>
#include <iomanip>
using namespace CSV;

int main() {
    std::string const input = "Hello,1,13.7,XXX\nWorld,2,1e3,YYY";
    auto f = begin(input), l = end(input);

    auto p = csv_parser(text{}, integer{}, real{}, skip{});

    if (parse(f, l, p)) {
        std::cout << "Parsed\n";
    } else {
        std::cout << "Failed\n";
    }

    if (f!=l) {
        std::cout << "Remaining: " << std::quoted(std::string(f,l)) << "\n";
    }
}

启用了调试信息的版本:

在魔盒上直播 >

 <line>
  <try>Hello,1,13.7,XXX\nWor</try>
  <CSV::text>
    <try>Hello,1,13.7,XXX\nWor</try>
    <success>,1,13.7,XXX\nWorld,2,</success>
  </CSV::text>
  <CSV::integer>
    <try>1,13.7,XXX\nWorld,2,1</try>
    <success>,13.7,XXX\nWorld,2,1e</success>
  </CSV::integer>
  <CSV::real>
    <try>13.7,XXX\nWorld,2,1e3</try>
    <success>,XXX\nWorld,2,1e3,YYY</success>
  </CSV::real>
  <CSV::skip>
    <try>XXX\nWorld,2,1e3,YYY</try>
    <success>\nWorld,2,1e3,YYY</success>
  </CSV::skip>
  <success>\nWorld,2,1e3,YYY</success>
</line>
<line>
  <try>World,2,1e3,YYY</try>
  <CSV::text>
    <try>World,2,1e3,YYY</try>
    <success>,2,1e3,YYY</success>
  </CSV::text>
  <CSV::integer>
    <try>2,1e3,YYY</try>
    <success>,1e3,YYY</success>
  </CSV::integer>
  <CSV::real>
    <try>1e3,YYY</try>
    <success>,YYY</success>
  </CSV::real>
  <CSV::skip>
    <try>YYY</try>
    <success></success>
  </CSV::skip>
  <success></success>
</line>
Parsed
 

注意事项,警告:

  • 对于任何mutable,请注意副作用.例如.如果您具有a | b并且a包括column_value_parser,则当a失败并且与b匹配时,回滚增加pos的副作用不会./p>

    简而言之,这会使解析函数不纯.

I want to efficiently parse large CSV-like files, whose order of columns I get at runtime. With Spirit Qi, I would parse each field with a lazy auxiliary parser that would select at runtime which column-specific parser to apply to each column. But X3 doesn't seem to have lazy (despite that it's listed in documentation). After reading recommendations here on SO, I've decided to write a custom parser.

It ended up being pretty nice, but now I've noticed I don't really need the pos variable be exposed anywhere outside the custom parser itself. I've tried putting it into the custom parser itself and started getting compiler errors stating that the column_value_parser object is read-only. Can I somehow put pos into the parser structure?

Simplified code that gets the compile-time error, with commented out parts of my working version:

#include <iostream>
#include <variant>

#include <boost/spirit/home/x3.hpp>
#include <boost/spirit/home/support.hpp>

namespace helpers {
    // https://bitbashing.io/std-visit.html
    template<class... Ts> struct overloaded : Ts... { using Ts::operator()...; };
    template<class... Ts> overloaded(Ts...) -> overloaded<Ts...>;
}

auto const unquoted_text_field = *(boost::spirit::x3::char_ - ',' - boost::spirit::x3::eol);

struct text { };
struct integer { };
struct real { };
struct skip { };
typedef std::variant<text, integer, real, skip> column_variant;

struct column_value_parser : boost::spirit::x3::parser<column_value_parser> {
    typedef boost::spirit::unused_type attribute_type;

    std::vector<column_variant>& columns;
    // size_t& pos;
    size_t pos;

    // column_value_parser(std::vector<column_variant>& columns, size_t& pos)
    column_value_parser(std::vector<column_variant>& columns)
        : columns(columns)
    //    , pos(pos)
        , pos(0)
    { }

    template<typename It, typename Ctx, typename Other, typename Attr>
    bool parse(It& f, It l, Ctx& ctx, Other const& other, Attr& attr) const {
        auto const saved_f = f;
        bool successful = false;

        visit(
            helpers::overloaded {
                [&](skip const&) {
                    successful = boost::spirit::x3::parse(f, l, boost::spirit::x3::omit[unquoted_text_field]);
                },
                [&](text& c) {
                    std::string value;
                    successful = boost::spirit::x3::parse(f, l, unquoted_text_field, value);
                    if(successful) {
                        std::cout << "Text: " << value << '\n';
                    }
                },
                [&](integer& c) {
                    int value;
                    successful = boost::spirit::x3::parse(f, l, boost::spirit::x3::int_, value);
                    if(successful) {
                        std::cout << "Integer: " << value << '\n';
                    }
                },
                [&](real& c) {
                    double value;
                    successful = boost::spirit::x3::parse(f, l, boost::spirit::x3::double_, value);
                    if(successful) {
                        std::cout << "Real: " << value << '\n';
                    }
                }
            },
            columns[pos]);

        if(successful) {
            pos = (pos + 1) % columns.size();
            return true;
        } else {
            f = saved_f;
            return false;
        }
    }
};


int main(int argc, char *argv[])
{
    std::string input = "Hello,1,13.7,XXX\nWorld,2,1e3,YYY";

    // Comes from external source.
    std::vector<column_variant> columns = {text{}, integer{}, real{}, skip{}};
    size_t pos = 0;

    boost::spirit::x3::parse(
        input.begin(), input.end(),
//         (column_value_parser(columns, pos) % ',') % boost::spirit::x3::eol);
        (column_value_parser(columns) % ',') % boost::spirit::x3::eol);
}

XY: My goal is to parse ~500 GB of pseudo-CSV files in a reasonable time on a machine with little RAM, convert into a list of (roughly) [row-number, column-name, value], then put into storage. The format is actually a little more complex than CSV: database dumps formatted in… human-friendly way, with column values being actually several small sublangauges (e.g. dates or, uh, something similar to whole apache log lines stuffed into a single field), and I'm often extracting only one specific part of each column. Different files may have different columns and in different order, which I can only learn by parsing yet another set of files containing original queries. Thankfully, Spirit makes it a breeze…

解决方案

Three answers:

  1. The easiest fix is to make pos a mutable member
  2. The X3 hardcore answer is x3::with<>
  3. Functional composition

1. Making pos mutable

Live On Wandbox

#include <iostream>
#include <variant>

#include <boost/spirit/home/x3.hpp>
#include <boost/spirit/home/support.hpp>

namespace helpers {
    // https://bitbashing.io/std-visit.html
    template<class... Ts> struct overloaded : Ts... { using Ts::operator()...; };
    template<class... Ts> overloaded(Ts...) -> overloaded<Ts...>;
}

auto const unquoted_text_field = *(boost::spirit::x3::char_ - ',' - boost::spirit::x3::eol);

struct text { };
struct integer { };
struct real { };
struct skip { };
typedef std::variant<text, integer, real, skip> column_variant;

struct column_value_parser : boost::spirit::x3::parser<column_value_parser> {
    typedef boost::spirit::unused_type attribute_type;

    std::vector<column_variant>& columns;
    size_t mutable pos = 0;
    struct pos_tag;

    column_value_parser(std::vector<column_variant>& columns)
        : columns(columns)
    { }

    template<typename It, typename Ctx, typename Other, typename Attr>
    bool parse(It& f, It l, Ctx& /*ctx*/, Other const& /*other*/, Attr& /*attr*/) const {
        auto const saved_f = f;
        bool successful = false;

        visit(
            helpers::overloaded {
                [&](skip const&) {
                    successful = boost::spirit::x3::parse(f, l, boost::spirit::x3::omit[unquoted_text_field]);
                },
                [&](text&) {
                    std::string value;
                    successful = boost::spirit::x3::parse(f, l, unquoted_text_field, value);
                    if(successful) {
                        std::cout << "Text: " << value << '\n';
                    }
                },
                [&](integer&) {
                    int value;
                    successful = boost::spirit::x3::parse(f, l, boost::spirit::x3::int_, value);
                    if(successful) {
                        std::cout << "Integer: " << value << '\n';
                    }
                },
                [&](real&) {
                    double value;
                    successful = boost::spirit::x3::parse(f, l, boost::spirit::x3::double_, value);
                    if(successful) {
                        std::cout << "Real: " << value << '\n';
                    }
                }
            },
            columns[pos]);

        if(successful) {
            pos = (pos + 1) % columns.size();
            return true;
        } else {
            f = saved_f;
            return false;
        }
    }
};


int main() {
    std::string input = "Hello,1,13.7,XXX\nWorld,2,1e3,YYY";

    std::vector<column_variant> columns = {text{}, integer{}, real{}, skip{}};

    boost::spirit::x3::parse(
        input.begin(), input.end(),
        (column_value_parser(columns) % ',') % boost::spirit::x3::eol);
}

2. x3::with<>

This is similar but with better (re)entrancy and encapsulation:

Live On Wandbox

#include <iostream>
#include <variant>

#include <boost/spirit/home/x3.hpp>
#include <boost/spirit/home/support.hpp>

namespace helpers {
    // https://bitbashing.io/std-visit.html
    template<class... Ts> struct overloaded : Ts... { using Ts::operator()...; };
    template<class... Ts> overloaded(Ts...) -> overloaded<Ts...>;
}

auto const unquoted_text_field = *(boost::spirit::x3::char_ - ',' - boost::spirit::x3::eol);

struct text { };
struct integer { };
struct real { };
struct skip { };
typedef std::variant<text, integer, real, skip> column_variant;

struct column_value_parser : boost::spirit::x3::parser<column_value_parser> {
    typedef boost::spirit::unused_type attribute_type;

    std::vector<column_variant>& columns;

    column_value_parser(std::vector<column_variant>& columns)
        : columns(columns)
    { }

    template<typename It, typename Ctx, typename Other, typename Attr>
    bool parse(It& f, It l, Ctx const& ctx, Other const& /*other*/, Attr& /*attr*/) const {
        auto const saved_f = f;
        bool successful = false;

        size_t& pos = boost::spirit::x3::get<pos_tag>(ctx).value;

        visit(
            helpers::overloaded {
                [&](skip const&) {
                    successful = boost::spirit::x3::parse(f, l, boost::spirit::x3::omit[unquoted_text_field]);
                },
                [&](text&) {
                    std::string value;
                    successful = boost::spirit::x3::parse(f, l, unquoted_text_field, value);
                    if(successful) {
                        std::cout << "Text: " << value << '\n';
                    }
                },
                [&](integer&) {
                    int value;
                    successful = boost::spirit::x3::parse(f, l, boost::spirit::x3::int_, value);
                    if(successful) {
                        std::cout << "Integer: " << value << '\n';
                    }
                },
                [&](real&) {
                    double value;
                    successful = boost::spirit::x3::parse(f, l, boost::spirit::x3::double_, value);
                    if(successful) {
                        std::cout << "Real: " << value << '\n';
                    }
                }
            },
            columns[pos]);

        if(successful) {
            pos = (pos + 1) % columns.size();
            return true;
        } else {
            f = saved_f;
            return false;
        }
    }

    template <typename T>
    struct Mutable { T mutable value; };
    struct pos_tag;

    auto invoke() const {
        return boost::spirit::x3::with<pos_tag>(Mutable<size_t>{}) [ *this ];
    }
};


int main() {
    std::string input = "Hello,1,13.7,XXX\nWorld,2,1e3,YYY";

    std::vector<column_variant> columns = {text{}, integer{}, real{}, skip{}};
    column_value_parser p(columns);

    boost::spirit::x3::parse(
        input.begin(), input.end(),
        (p.invoke() % ',') % boost::spirit::x3::eol);
}

3. Functional Composition

Because it's so much easier in X3, my favourite is to just generate the parser on demand.

Without requirements, this is the simplest I'd propose:

Live On Wandbox

#include <boost/spirit/home/x3.hpp>
namespace x3 = boost::spirit::x3;

namespace CSV {
    struct text    { };
    struct integer { };
    struct real    { };
    struct skip    { };

    auto const unquoted_text_field = *~x3::char_(",\n");
    static inline auto as_parser(skip)    { return x3::omit[unquoted_text_field]; }
    static inline auto as_parser(text)    { return unquoted_text_field;           }
    static inline auto as_parser(integer) { return x3::int_;                      }
    static inline auto as_parser(real)    { return x3::double_;                   }

    template <typename... Spec>
    static inline auto line_parser(Spec... spec) {
        auto delim = ',' | &(x3::eoi | x3::eol);
        return ((as_parser(spec) >> delim) >> ... >> x3::eps);
    }

    template <typename... Spec> static inline auto csv_parser(Spec... spec) {
        return line_parser(spec...) % x3::eol;
    }
}

#include <iostream>
#include <iomanip>
using namespace CSV;

int main() {
    std::string const input = "Hello,1,13.7,XXX\nWorld,2,1e3,YYY";
    auto f = begin(input), l = end(input);

    auto p = csv_parser(text{}, integer{}, real{}, skip{});

    if (parse(f, l, p)) {
        std::cout << "Parsed\n";
    } else {
        std::cout << "Failed\n";
    }

    if (f!=l) {
        std::cout << "Remaining: " << std::quoted(std::string(f,l)) << "\n";
    }
}

A version with debug information enabled:

Live On Wandbox

<line>
  <try>Hello,1,13.7,XXX\nWor</try>
  <CSV::text>
    <try>Hello,1,13.7,XXX\nWor</try>
    <success>,1,13.7,XXX\nWorld,2,</success>
  </CSV::text>
  <CSV::integer>
    <try>1,13.7,XXX\nWorld,2,1</try>
    <success>,13.7,XXX\nWorld,2,1e</success>
  </CSV::integer>
  <CSV::real>
    <try>13.7,XXX\nWorld,2,1e3</try>
    <success>,XXX\nWorld,2,1e3,YYY</success>
  </CSV::real>
  <CSV::skip>
    <try>XXX\nWorld,2,1e3,YYY</try>
    <success>\nWorld,2,1e3,YYY</success>
  </CSV::skip>
  <success>\nWorld,2,1e3,YYY</success>
</line>
<line>
  <try>World,2,1e3,YYY</try>
  <CSV::text>
    <try>World,2,1e3,YYY</try>
    <success>,2,1e3,YYY</success>
  </CSV::text>
  <CSV::integer>
    <try>2,1e3,YYY</try>
    <success>,1e3,YYY</success>
  </CSV::integer>
  <CSV::real>
    <try>1e3,YYY</try>
    <success>,YYY</success>
  </CSV::real>
  <CSV::skip>
    <try>YYY</try>
    <success></success>
  </CSV::skip>
  <success></success>
</line>
Parsed

Notes, Caveats:

  • With anything mutable, beware of side-effects. E.g. if you have a | b and a includes column_value_parser, the side-effect of incrementing pos will not be rolled back when a fails and b is matched instead.

    In short, this makes your parse function impure.

这篇关于Spirit X3:具有内部状态的解析器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆