如何为给定的 boost::spirit 语法提供自动完成建议的用户? [英] How to provider user with autocomplete suggestions for given boost::spirit grammar?

查看:36
本文介绍了如何为给定的 boost::spirit 语法提供自动完成建议的用户?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Boost::Spirit 在我的 C++ GUI 应用程序中为非技术用户构建简单的数据过滤器"语言.语言与普通英语非常相似,可以解析为 AST.我被要求尽可能使过程对用户友好,因此我希望提供类似 CLang 的错误消息(无法识别的 'tokken',您的意思是 'tokken'?")和自动完成.

现在的问题是如何使用 Boost::Spirit 语法为两个目标生成可能的标记列表(我将使用简单的字符串距离算法来满足第一个要求)?

我目前的想法:

  • 将源流中的位置和长度添加到从解析器获得的令牌定义中.
  • 向语法添加特殊的无效标记,它将匹配任何内容......好吧......无效:-)
  • 如果用户按 ctrl + 空格键,构建 AST(使用无效令牌,树将始终可构建),然后在当前光标位置内查找令牌
  • 在所有可能的标记上使用简单的匹配器(毕竟我有标记列表)并准备一个建议列表

此解决方案的问题在于,该建议还会建议对给定位置无效的令牌.如果我添加(并且我会)可定义的标识符,我手头的问题就会大得多...

还有一个限制:我想只在一个地方定义这种语言的语法;如果语法发生变化,我希望自动完成程序在重新编译后意识到这一变化

解决方案

出于好奇,我决定尝试这个概念.

这是我的尝试.

计划

让我们用函数调用来解析算术表达式.

现在,我们想用可能的未知标识符解析(部分)表达式.

在不完整的表达式的情况下,我们希望暗示"完成它的最小标记(并可能继续解析).

对于未知标识符,我们希望对域中的已知标识符(变量或函数)进行模糊匹配,并按概率递减的顺序对它们进行排序.

<小时>

基本定义

让我们首先决定我们希望我们的输入在内存中,所以我们可以使用 string_views 来引用位置/子字符串:

#include 使用 Source = boost::string_view;使用 Location = Source::const_iterator;

完成提示

除了 AST,我们希望我们的解析器生成完成提示(隐式完成标记和建议)

命名空间完成{使用 Candidates = std::vector;类提示{struct ByLocation {模板 <typename T, typename U>bool operator()(T const& a, U const& b) const { return loc(a) <地方(乙);}私人的:静态位置 loc(Source const& s) { return s.begin();}静态位置 loc(Location const& l) { return l;}};民众:std::map不完整;std::map<源、候选、按位置>建议;/*显式*/operator bool() const { return不完整的.size() ||建议大小();}};

此外,让我们编写一个快速的 &脏模糊标识符匹配评分函数.

我选择了一个简单的同步比较

  • 逐步对相应的字符进行评分,并且
  • 倾向于从候选字符中跳过字符而不是从输入中跳过字符(这意味着 adj_diffadjacent_difference 的一个很好的模糊匹配,即使字符已经从候选字符中被跳过,但是adj_qqq_diff 更糟,因为输入中的 qqq 无法匹配)
  • 算法是递归完成的,没有分配
  • 如果匹配,第一个字符会得到提升(rate=1 在第一次调用时)

 static int Fuzzy_match(Source input, boost::string_viewCandidate, int rate = 1) {//从第一个字母 boost 开始整数分数 = 0;while (!(input.empty() ||Candidate.empty())) {如果(输入.前()!=候选人.前()){返回分数 + std::max(Fuzzy_match(input.substr(1),Candidate, std::max(rate-2,0)),//忽略输入字符的惩罚模糊匹配(输入,候选.substr(1),std::max(rate-1,0)));}input.remove_prefix(1);候选人.remove_prefix(1);得分 += ++率;}返回分数;}}//命名空间完成

我们将看到如何在语法中使用它.

AST

可以执行二进制表达式、字符串/数字文字、变量和(嵌套)函数调用的普通 AST:

#include 命名空间 Ast {使用 NumLiteral = double;使用 StringLiteral = std::string;//去转义,不是源视图结构标识符:std::string {使用 std::string::string;使用 std::string::operator=;};结构体二进制表达式;结构调用表达式;使用表达式 = boost::variant<数字文字,字符串字面量,标识符,boost::recursive_wrapper,boost::recursive_wrapper>;结构体二进制表达式{表达式 lhs;字符操作;表达式 rhs;};使用 ArgList = std::vector<表达式>;结构调用表达式{标识符功能;ArgList 参数;};}

语法

<小时>

基础知识

语法一开始也很基本":

命名空间解析{命名空间 qi = boost::spirit::qi;命名空间 phx = boost::phoenix;模板 结构体表达式:qi::grammar{表达式(完成::提示和提示):表达式::基类型(开始),_提示(提示){使用命名空间qi;开始 = 跳过(空格)[表达式];表达式 = 术语 [_val = _1] >>*(char_("-+") >> 表达式) [_val = make_binary(_val, _1, _2)];术语 = 因子 [_val = _1] >>*(char_("*/") >> term) [_val = make_binary(_val, _1, _2)];因子 = 简单 [_val = _1] >>*(char_("^") >> 因子) [_val = make_binary(_val, _1, _2)];简单 = 调用 |变量 |复合|数量 |细绳;

到目前为止一切顺利:构造函数存储对要记录的 Completion::Hints& 的引用.所有这些规则都被声明为 qi::rule.

隐含令牌

现在有点有趣了,这里的一些规则是词素¹

 number = double_;identifier = raw[(alpha|'_') >>*(alnum|'_')];

有些作品可以容忍缺少结束标记:

//表示结束引号字符串 %= '"' >> *('\' >> char_ | ~char_('"')) >>默示('"');化合物 %= '(' >> 表达式 >> 隐含(')');

implied 魔法使得如果预期的结束标记丢失,它将被记录为提示,并继续解析:

 自动隐含 = [=](char ch) {return copy(omit[lit(ch) | raw[eps][imply(_1, ch)]]);};

当然,这引出了imply(_1, ch) 真正做了什么的问题,我们稍后会看到.

<块引用>

现在,观察我们执行 raw[eps] 以获取(空)源 iterator_range 以从语义动作.

标识符查找

我们在Domain成员_variables_functions中存储符号表":

 using Domain = qi::symbols;域_变量,_函数;

然后我们声明一些可以对其中任何一个进行查找的规则:

//域标识符查找qi::_r1_type _domain;qi::rule<It, Ast::Identifier(Domain const&)>也许_已知,已知,未知;

很快就会显示相应的声明.

变量非常简单:

 variable = may_known(phx::ref(_variables));

通话比较棘手.如果一个名字是未知的,我们不想假设它暗示了一个函数,除非它后面跟着一个'('字符.然而,如果一个标识符是一个已知的函数名,我们甚至想暗示 ( (这使 UX 具有自动完成的外观,当用户键入 sqrt 时,它建议下一个字符为 ( 神奇地).

//启发式://- 一个未知标识符后跟 (//- 一个未关闭的参数列表意味着 )call %= ( known(phx::ref(_functions))//known -> 暗示括号|&(标识符>>'(')>>未知(phx::ref(_functions))) >>隐含('(')>> -(表达式%',')>>隐含(')');

这一切都建立在 knownunknownmaybe_known 之上:

///////////////////////////////////标识符loopkup,提示{也许_known = known(_domain) |未知(_域);//不同以避免部分匹配的标识符使用 boost::spirit::repository::qi::distinct;auto kw = distinct(copy(alnum | '_'));已知 = raw[kw[lazy(_domain)]];未知 = 原始 [标识符 [_val=_1]] [suggest_for(_1, _domain)];}

调试、预定义变量/函数

剩下的有点繁文缛节:

 BOOST_SPIRIT_DEBUG_NODES((开始)(表达式)(项)(因子)(简单)(复合)(调用)(变量)(标识符)(数字)(字符串)//(maybe_known)(known)(unknown)//qi::symbols<>不可流式传输)_variables += "foo", "bar", "qux";_functions += "print", "sin", "tan", "sqrt", "frobnicate";}私人的:完成::提示&_提示;使用 Domain = qi::symbols;域_变量,_函数;qi::rule开始;qi::rule表达式,术语,因子,简单;//可完成的qi::rule化合物;qi::rule称呼;//隐式词素qi::rule变量,标识符;qi::rule数字;qi::rule<It, Ast::StringLiteral()>细绳;//域标识符查找qi::_r1_type _domain;qi::rule<It, Ast::Identifier(Domain const&)>也许_已知,已知,未知;

凤凰帮

让我们从构建二进制 AST 节点的常用助手开始:

///////////////////////////////////二进制表达式工厂结构 make_binary_f {Ast::BinaryExpression operator()(Ast::Expression const& lhs, char op, Ast::Expression const& rhs) const {返回 {lhs, op, rhs};}};boost::phoenix::functionmake_binary;

同样,我们可以有一个 Phoenix 函数对象来注册隐含字符:

///////////////////////////////////自动补全不完整的表达式结构暗示_f {完成::提示&_提示;void operator()(boost::iterator_range where, char hidden_​​char) const {自动插入 =_hints.incomplete.emplace(&*where.begin(), std::string(1, hidden_​​char));//将隐含字符添加到现有完成如果(!插入.第二)插入的.第一个->第二个+=隐含的字符;}};boost::phoenix::function暗示 { 暗示_f { _hints } };

请注意,它按位置排序(这使用户体验更容易)并检测先前隐含令牌何时位于同一位置,在这种情况下,我们只需附加到它.

到现在为止,为未知标识符生成相关建议也不足为奇了:

///////////////////////////////////建议为结构建议{完成::提示&_提示;void operator()(boost::iterator_range where, Domain const& symbols) const {使用命名空间完成;源 id(&*where.begin(), where.size());候选人 c;Symbols.for_each([&](std::string const& k, ...) { c.push_back(k); });自动评分 = [id](Source v) { return Fuzzy_match(id, v);};auto byscore = [=](Source a, Source b) { return score(a) >得分(b);};sort(c.begin(), c.end(), byscore);c.erase( remove_if(c.begin(), c.end(), [=](Source s) { return score(s) <3; }), c.end());_hints.suggestions.emplace(id, c);}};boost::phoenix::function;建议_为{建议{_提示}};

仅此而已.如果它看起来更复杂,那是因为它做了更多的事情:它对所有候选人进行模糊评分,按分数降序对他们进行排序,并删除任何分数小于等于的候选人.3.

 };}

奖金

让我们稍微改变一下,让 ',' 隐含在参数列表中:

 call %= ( known(phx::ref(_functions))//known -> 暗示括号|&(标识符>>'(')>>未知(phx::ref(_functions)))>>默示('(')>>-(表达 % ',')>>默示(')');

让我们在那里替换 ',' :

 >>-(表达式 % (',' | !(')'|eoi) >> 隐含(',')))

<块引用>

注意:检测 &expression 来断言后面跟着一个表达式,而不是断言输入/参数列表的末尾有未到达.

但是,这样做会很糟糕,因为任何包含的表达式都可能触发更多隐含标记作为副作用,导致重复的隐含标记.

这个(副作用语义动作)是我通常避免语义动作的主要原因之一,或者使用它们仅在规则的(公开的)属性中存储效果.参见 Boost Spirit:语义动作是邪恶的"?

测试驱动

如果没有一些不错的测试用例,这篇文章将一事无成.所以这里是:

int main() {对于(源常量输入:{,      //无效的"(3",//不完整,暗示 ')'"3*(6+sqrt(9))^7 - 1e8",//完全有效"(3*(((6+sqrt(9))^7 - 1e8",//不完整,暗示 ")))""print("hello \"world!",//完成字符串字面量和参数列表"foo",//好的,已知变量"baz",//(建议栏)"baz(",//不完整,隐含 ')',未知函数"taz(",//不完整,隐含 ')',未知函数"san(",//2 个建议 (sin/tan)"打印(1, 2, "三", 复杂的(san(78","(print sqrt sin 9) -0) "bye",}){std::cout <<-------------- '" <<输入<<"'
";位置 f = input.begin(), l = input.end();Ast::Expression expr;完成::提示提示;bool ok = parse(f, l, Parsing::Expression{hints}, expr);如果(提示){std::cout <<输入:'"<<输入<<"'
";}for (auto& c : hints.incomplete) {std::cout <<失踪" <<std::setw(c.first - input.begin()) <<"<<"^ 暗示:'" <<c.第二个<<"'
";}for (auto& id :hints.suggestions) {std::cout <<未知"<<std::setw(id.first.begin() - input.begin()) <<"<

请注意,除了第一个输入 ("") 之外,所有内容都被启发式解析为 某种 类型的表达式!最后一个旨在展示所有参数列表是如何隐含的,因为 printsqrtsin 是已知函数.然后隐含一些 ',' ,最后关闭未闭合的字符串文字和剩余的括号.这是(非调试)输出:

-------------- ''解析失败-------------- '(3'输入:'(3'缺少 ^ 暗示:')'AST:3-------------- '3*(6+sqrt(9))^7 - 1e8'AST: ((3 * ((6 + (sqrt (9))) ^ 7)) - 1e+08)-------------- '(3*(((6+sqrt(9))^7 - 1e8'输入:'(3*(((6+sqrt(9))^7 - 1e8'缺少 ^ 暗示:')))'AST: (3 * (((6 + (sqrt (9))) ^ 7) - 1e+08))-------------- 'print("hello "world!'输入:'print("hello "world!'缺少 ^ 暗示:'")'AST:(打印(你好世界!))-------------- '富'AST: foo-------------- '巴兹'输入:'baz'未知 ^^^(您的意思是酒吧"吗?)AST:巴兹-------------- 'baz('输入:'baz('缺少 ^ 暗示:')'未知 ^^^(没有建议)AST:(巴兹())-------------- 'taz('输入:'taz('缺少 ^ 暗示:')'未知 ^^^(您是说棕褐色"吗?)AST: (taz ())-------------- 'san('输入:'san('缺少 ^ 暗示:')'未知 ^^^(你的意思是罪"还是棕褐色"?)AST:(圣())-------------- 'print(1, 2, "three", complex(san(78')输入:'print(1, 2, "three", complex(san(78')缺少 ^ 暗示:')))'未知 ^^^^^^^^^^^(你是说frobnicate"还是print"?)未知 ^^^(你的意思是罪"还是棕褐色"?)AST: (打印 (1, 2, 三, (复杂 ((san (78))))))-------------- '(print sqrt sin 9) -0) 再见"输入:'(print sqrt sin 9) -0) 再见"缺少 ^ 暗示:'('缺少 ^ 暗示:'('缺少 ^ 暗示:'('缺少 ^ 暗示:','缺少 ^ 暗示:'"))'AST: (打印 ((sqrt (((sin (9)) - 0))), 再见))

完整列表/现场演示

生活在 Coliru

//#define BOOST_SPIRIT_DEBUG#include 使用 Source = boost::string_view;使用 Location = Source::const_iterator;#include <地图>#include <向量>命名空间完成{static int Fuzzy_match(Source input, boost::string_viewCandidate, int rate = 1) {//从第一个字母 boost 开始整数分数 = 0;while (!(input.empty() ||Candidate.empty())) {如果(输入.前()!=候选人.前()){返回分数 + std::max(Fuzzy_match(input.substr(1),Candidate, std::max(rate-2,0)),//忽略输入字符的惩罚模糊匹配(输入,候选.substr(1),std::max(rate-1,0)));}input.remove_prefix(1);候选人.remove_prefix(1);得分 += ++率;}返回分数;}使用 Candidates = std::vector;类提示{struct ByLocation {模板 <typename T, typename U>bool operator()(T const& a, U const& b) const { return loc(a) <地方(乙);}私人的:静态位置 loc(Source const& s) { return s.begin();}静态位置 loc(Location const& l) { return l;}};民众:std::map不完整;std::map<源、候选、按位置>建议;/*显式*/operator bool() const { return不完整的.size() ||建议大小();}};}#include 命名空间 Ast {使用 NumLiteral = double;使用 StringLiteral = std::string;//去转义,不是源视图结构标识符:std::string {使用 std::string::string;使用 std::string::operator=;};结构体二进制表达式;结构调用表达式;使用表达式 = boost::variant<数字文字,字符串字面量,标识符,boost::recursive_wrapper,boost::recursive_wrapper>;结构体二进制表达式{表达式 lhs;字符操作;表达式 rhs;};使用 ArgList = std::vector<表达式>;结构调用表达式{标识符功能;ArgList 参数;};}#include BOOST_FUSION_ADAPT_STRUCT(Ast::BinaryExpression, lhs, op, rhs)BOOST_FUSION_ADAPT_STRUCT(Ast::CallExpression,函数,参数)#include #include #include //用于调试打印:命名空间{struct once_t {//自动重置标志运算符 bool() { bool v = 标志;标志 = 假;返回 v;}布尔标志 = 真;};}//用于调试打印:命名空间 Ast {静态内联 std::ostream&运算符<<(std::ostream&os, std::vector<Expression>const&args){操作系统<<"(";首先一次_t;for (auto& a : args) os <<(第一个?"":", ") <<一种;返回操作系统<<")";}静态内联 std::ostream&运算符<<(std::ostream& os, BinaryExpression const& e) {返回操作系统<结构体表达式:qi::grammar{表达式(完成::提示和提示):表达式::基类型(开始),_提示(提示){使用命名空间qi;开始 = 跳过(空格)[表达式];表达式 = 术语 [_val = _1] >>*(char_("-+") >> 表达式) [_val = make_binary(_val, _1, _2)];术语 = 因子 [_val = _1] >>*(char_("*/") >> term) [_val = make_binary(_val, _1, _2)];因子 = 简单 [_val = _1] >>*(char_("^") >> 因子) [_val = make_binary(_val, _1, _2)];简单 = 调用 |变量 |复合|数量 |细绳;自动隐含 = [=](char ch) {return copy(omit[lit(ch) | raw[eps][imply(_1, ch)]]);};变量=maybe_known(phx::ref(_variables));化合物 %= '(' >> 表达式 >> 隐含(')');//启发式://- 一个未知标识符后跟 (//- 一个未关闭的参数列表意味着 )call %= ( known(phx::ref(_functions))//known -> 暗示括号|&(标识符>>'(')>>未知(phx::ref(_functions)))>>默示('(')>>-(表达式 % (',' | !(')'|eoi) >> 隐含(',')))>>默示(')');//词素,原始规则identifier = raw[(alpha|'_') >>*(alnum|'_')];//暗示结束引号字符串 %= '"' >> *('\' >> char_ | ~char_('"')) >>hidden('"');//TODO 更多转义数字 = double_;//TODO 整数参数///////////////////////////////////标识符loopkup,提示{也许_known = known(_domain) |未知(_域);//不同以避免部分匹配的标识符使用 boost::spirit::repository::qi::distinct;auto kw = distinct(copy(alnum | '_'));已知 = raw[kw[lazy(_domain)]];未知 = 原始 [标识符 [_val=_1]] [suggest_for(_1, _domain)];}BOOST_SPIRIT_DEBUG_NODES((开始)(表达式)(项)(因子)(简单)(复合)(调用)(变量)(标识符)(数字)(字符串)//(maybe_known)(known)(unknown)//qi::symbols<>不可流式传输)_variables += "foo", "bar", "qux";_functions += "print", "sin", "tan", "sqrt", "frobnicate";}私人的:完成::提示&_提示;使用 Domain = qi::symbols;域_变量,_函数;qi::rule开始;qi::rule表达式,术语,因子,简单;//可完成的qi::rule化合物;qi::rule称呼;//隐式词素qi::rule变量,标识符;qi::rule数字;qi::rule<It, Ast::StringLiteral()>细绳;//域标识符查找qi::_r1_type _domain;qi::rule<It, Ast::Identifier(Domain const&)>也许_已知,已知,未知;///////////////////////////////////二进制表达式工厂结构 make_binary_f {Ast::BinaryExpression operator()(Ast::Expression const& lhs, char op, Ast::Expression const& rhs) const {返回 {lhs, op, rhs};}};boost::phoenix::functionmake_binary;///////////////////////////////////自动补全不完整的表达式结构暗示_f {完成::提示&_提示;void operator()(boost::iterator_range where, char hidden_​​char) const {自动插入 =_hints.incomplete.emplace(&*where.begin(), std::string(1, hidden_​​char));//将隐含字符添加到现有完成如果(!插入.第二)插入的.第一个->第二个+=隐含的字符;}};boost::phoenix::function暗示 { 暗示_f { _hints } };///////////////////////////////////建议为结构建议{完成::提示&_提示;void operator()(boost::iterator_range where, Domain const& symbols) const {使用命名空间完成;源 id(&*where.begin(), where.size());候选人 c;Symbols.for_each([&](std::string const& k, ...) { c.push_back(k); });自动评分 = [id](Source v) { return Fuzzy_match(id, v);};auto byscore = [=](Source a, Source b) { return score(a) >得分(b);};sort(c.begin(), c.end(), byscore);c.erase( remove_if(c.begin(), c.end(), [=](Source s) { return score(s) <3; }), c.end());_hints.suggestions.emplace(id, c);}};boost::phoenix::function;建议_为{建议{_提示}};};}#include #include int main() {对于(源常量输入:{,      //无效的"(3",//不完整,暗示 ')'"3*(6+sqrt(9))^7 - 1e8",//完全有效"(3*(((6+sqrt(9))^7 - 1e8",//不完整,暗示 ")))""print("hello \"world!",//完成字符串字面量和参数列表"foo",//好的,已知变量"baz",//(建议栏)"baz(",//不完整,隐含 ')',未知函数"taz(",//不完整,隐含 ')',未知函数"san(",//2 个建议 (sin/tan)"打印(1, 2, "三", 复杂的(san(78","(print sqrt sin 9) -0) "bye",}){std::cout <<-------------- '" <<输入<<"'
";位置 f = input.begin(), l = input.end();Ast::Expression expr;完成::提示提示;bool ok = parse(f, l, Parsing::Expression{hints}, expr);如果(提示){std::cout <<输入:'"<<输入<<"'
";}for (auto& c : hints.incomplete) {std::cout <<失踪" <<std::setw(c.first - input.begin()) <<"<<"^ 暗示:'" <<c.第二个<<"'
";}for (auto& id :hints.suggestions) {std::cout <<未知"<<std::setw(id.first.begin() - input.begin()) <<"<

<小时>

¹ 提升精神船长问题

I am using Boost::Spirit to build simple "data-filter" language in my C++ GUI application for non-technical users. Language is very similiar to plain English and parse-able into AST. I am requested to make the process as user-friendly as possible, so I wish to provide CLang-like error messages ("Unrecognized 'tokken', did you meant 'token'?") and autocompletion.

The question in place is how to use Boost::Spirit grammar to generate possible token list for both goals (I will use simple string distance algorithm to satisfy first requirement)?

My ideas so far:

  • Add position and length in source stream to token definition obtained from parser.
  • Add special invalid token to grammar, which will match anything... well... not valid :-)
  • If user presses ctrl + space, build AST (with invalid token the tree will be always buildable), then look for token inside current cursor position
  • Use simple matcher on all possible tokens (I have token list after all) and prepare a list of suggestions

The problem with this solution is the suggestion will also suggest tokens that are invalid for given place. And if I add (and I will) definable identifiers, I have much bigger problem in hand...

One more constraint: I want to have grammar for this language defined in only one place; If the grammar changes, I want to autocompleter be aware of this changes after recompilation

解决方案

Out of curiosity, I decided to try the concept.

Here's my attempt.

Plan

Let's parse arithmetic expressions with function calls.

Now, we want to parse (partial) expression with possible unknown identifiers.

In case of incomplete expressions, we want to "imply" the minimal token to complete it (and potentially continue parsing).

In case of unknown identifiers, we want to fuzzy-match the known identifiers in the domain (either variables or functions) and rank them in order of decreasing probability.


Base Definitions

Let's start out by deciding we want our input to be in memory, so we can refer to locations/substrings by using string_views:

#include <boost/utility/string_view.hpp>

using Source = boost::string_view;
using Location = Source::const_iterator;

Completion Hints

Besides the AST, we want our parser to generate completion hints (the implicit completion tokens and suggestions)

namespace Completion {

    using Candidates = std::vector<std::string>;

    class Hints {
        struct ByLocation {
            template <typename T, typename U>
            bool operator()(T const& a, U const& b) const { return loc(a) < loc(b); }
          private:
            static Location loc(Source const& s) { return s.begin(); }
            static Location loc(Location const& l) { return l; }
        };

      public:
        std::map<Location, std::string, ByLocation> incomplete;
        std::map<Source, Candidates, ByLocation> suggestions;

        /*explicit*/ operator bool() const { return incomplete.size() || suggestions.size(); }
    };

In addition, let's code up a quick & dirty fuzzy identifier match scoring function.

I opted for a simple synchronizing compare that

  • scores corresponding runs of characters progressively, and
  • favours skipping characters from candidates over skipping characters from the input (meaning that adj_diff is a good fuzzy match for adjacent_difference even though characters have been skipped from the candidate, but adj_qqq_diff is worse because the qqq from the input could not be matched)
  • the algorithm is done recursively and without allocations
  • first characters get a boost if matched (rate=1 at first invocation)

    static int fuzzy_match(Source input, boost::string_view candidate, int rate = 1) { // start with first-letter boost
        int score = 0;

        while (!(input.empty() || candidate.empty())) {
            if (input.front() != candidate.front()) {
                return score + std::max(
                    fuzzy_match(input.substr(1), candidate, std::max(rate-2,0)), //penalty for ignoring an input char
                    fuzzy_match(input, candidate.substr(1), std::max(rate-1,0)));
            }

            input.remove_prefix(1);
            candidate.remove_prefix(1);
            score += ++rate;
        }
        return score;
    }

} // namespace Completion

We'll see how this is used in the grammar.

AST

A run-of-the-mill AST that can do binary expressions, string/numeric literals, variables and (nested) function calls:

#include <boost/variant.hpp>

namespace Ast {
    using NumLiteral = double;
    using StringLiteral = std::string; // de-escaped, not source view

    struct Identifier : std::string {
        using std::string::string;
        using std::string::operator=;
    };

    struct BinaryExpression;
    struct CallExpression;

    using Expression = boost::variant<
            NumLiteral,
            StringLiteral,
            Identifier,
            boost::recursive_wrapper<BinaryExpression>,
            boost::recursive_wrapper<CallExpression>
        >;

    struct BinaryExpression {
        Expression lhs;
        char op;
        Expression rhs;
    };

    using ArgList = std::vector<Expression>;

    struct CallExpression {
        Identifier function;
        ArgList args;
    };
}

Grammar


Basics

The grammar starts off pretty "basic" as well:

namespace Parsing {
    namespace qi  = boost::spirit::qi;
    namespace phx = boost::phoenix;

    template <typename It>
    struct Expression : qi::grammar<It, Ast::Expression()> {
        Expression(Completion::Hints& hints) : Expression::base_type(start), _hints(hints) {
            using namespace qi;

            start      = skip(space) [expression];

            expression = term   [_val = _1] >> *(char_("-+") >> expression) [_val = make_binary(_val, _1, _2)];
            term       = factor [_val = _1] >> *(char_("*/") >> term)       [_val = make_binary(_val, _1, _2)];
            factor     = simple [_val = _1] >> *(char_("^")  >> factor)     [_val = make_binary(_val, _1, _2)];

            simple     = call | variable | compound | number | string;

So far so good: The constructor stores a reference to the Completion::Hints& to be recorded. All these rules have been declared as qi::rule<It, Expression(), qi::space_type>.

Implied Tokens

Now it gets slightly more interesting, some rules here are lexemes¹

            number     = double_;
            identifier = raw[(alpha|'_') >> *(alnum|'_')];

And some productions are tolerant of missing closing tokens:

            // imply the closing quotes
            string   %= '"' >> *('\' >> char_ | ~char_('"')) >> implied('"');
            compound %= '(' >> expression >> implied(')');

The implied magic makes it so that if the expected closing token is missing, it will be recorded as a hint, and parsing continues:

            auto implied = [=](char ch) {
                return copy(omit[lit(ch) | raw[eps][imply(_1, ch)]]);
            };

Of course, this begs the question what imply(_1, ch) really does, and we'll see later.

For now, observe that we do raw[eps] to get an (empty) source iterator_range to construct a Location from in the semantic action.

Identifier Lookup

We store "symbol tables" in Domain members _variables and _functions:

        using Domain = qi::symbols<char>;
        Domain _variables, _functions;

Then we declare some rules that can do lookups on either of them:

        // domain identifier lookups
        qi::_r1_type _domain;
        qi::rule<It, Ast::Identifier(Domain const&)> maybe_known, known, unknown;

The corresponding declarations will be shown shortly.

Variables are pretty simple:

        variable   = maybe_known(phx::ref(_variables));

Calls are trickier. If a name is unknown we don't want to assume it implies a function unless it's followed by a '(' character. However, if an identifier is a known function name, we want even to imply the ( (this gives the UX the appearance of autocompletion where when the user types sqrt, it suggests the next character to be ( magically).

        // The heuristics:
        // - an unknown identifier followed by (
        // - an unclosed argument list implies )
        call %= ( known(phx::ref(_functions)) // known -> imply the parens
                     | &(identifier >> '(') >> unknown(phx::ref(_functions))
                     ) >> implied('(') >> -(expression % ',') >> implied(')');

It all builds on known, unknown and maybe_known:

            ///////////////////////////////
            // identifier loopkup, suggesting
            {
                maybe_known = known(_domain) | unknown(_domain);

                // distinct to avoid partially-matching identifiers
                using boost::spirit::repository::qi::distinct;
                auto kw     = distinct(copy(alnum | '_'));

                known       = raw[kw[lazy(_domain)]];
                unknown     = raw[identifier[_val=_1]] [suggest_for(_1, _domain)];
            }

Debug, Predefined Variables/Functions

Remaining is a bit of red tape:

            BOOST_SPIRIT_DEBUG_NODES(
                (start)
                (expression)(term)(factor)(simple)(compound)
                (call)(variable)
                (identifier)(number)(string)
                //(maybe_known)(known)(unknown) // qi::symbols<> non-streamable
            )

            _variables += "foo", "bar", "qux";
            _functions += "print", "sin", "tan", "sqrt", "frobnicate";
        }

      private:
        Completion::Hints& _hints;

        using Domain = qi::symbols<char>;
        Domain _variables, _functions;

        qi::rule<It, Ast::Expression()> start;
        qi::rule<It, Ast::Expression(), qi::space_type> expression, term, factor, simple;
        // completables
        qi::rule<It, Ast::Expression(), qi::space_type> compound;
        qi::rule<It, Ast::CallExpression(), qi::space_type> call;

        // implicit lexemes
        qi::rule<It, Ast::Identifier()> variable, identifier;
        qi::rule<It, Ast::NumLiteral()> number;
        qi::rule<It, Ast::StringLiteral()> string;

        // domain identifier lookups
        qi::_r1_type _domain;
        qi::rule<It, Ast::Identifier(Domain const&)> maybe_known, known, unknown;

Phoenix Helpers

Let's start with the usual helper to construct binary AST nodes:

        ///////////////////////////////
        // binary expression factory
        struct make_binary_f {
            Ast::BinaryExpression operator()(Ast::Expression const& lhs, char op, Ast::Expression const& rhs) const {
                return {lhs, op, rhs};
            }
        };
        boost::phoenix::function<make_binary_f> make_binary;

Similarly, we can have a Phoenix function object to register implied chars:

        ///////////////////////////////
        // auto-completing incomplete expressions
        struct imply_f {
            Completion::Hints& _hints;
            void operator()(boost::iterator_range<It> where, char implied_char) const {
                auto inserted = 
                    _hints.incomplete.emplace(&*where.begin(), std::string(1, implied_char));
                // add the implied char to existing completion
                if (!inserted.second)
                    inserted.first->second += implied_char;
            }
        };
        boost::phoenix::function<imply_f> imply { imply_f { _hints } };

Note that it orders by location (which makes UX easier) and detects when a previous implied token lived at the same location, in which case we simply append to it.

By now it won't be much of a surprise that generating relevant suggestions for unknown identifiers is also a phoenix actor:

        ///////////////////////////////
        // suggest_for
        struct suggester {
            Completion::Hints& _hints;

            void operator()(boost::iterator_range<It> where, Domain const& symbols) const {
                using namespace Completion;
                Source id(&*where.begin(), where.size());
                Candidates c;

                symbols.for_each([&](std::string const& k, ...) { c.push_back(k); });

                auto score = [id](Source v) { return fuzzy_match(id, v); };
                auto byscore = [=](Source a, Source b) { return score(a) > score(b); };

                sort(c.begin(), c.end(), byscore);
                c.erase( remove_if(c.begin(), c.end(), [=](Source s) { return score(s) < 3; }), c.end());

                _hints.suggestions.emplace(id, c);
            }
        };
        boost::phoenix::function<suggester> suggest_for {suggester{_hints}};

That's all. If it looks more complicated, that's because it does a lot more: it scores all candidates fuzzily, orders them by descending score and removes any candidates with score < 3.

    };
}

BONUS

Let's alter things a little more and allow ',' to be implied inside argument lists:

        call %= ( known(phx::ref(_functions)) // known -> imply the parens
                | &(identifier >> '(') >> unknown(phx::ref(_functions))
                ) 
            >> implied('(') 
            >> -(expression % ',')
            >> implied(')')
            ;

Let's replace ',' there:

            >> -(expression % (',' | !(')'|eoi) >> implied(',')))

NOTE: It might seem to make more sense to detect &expression to assert that an expression follows, instead asserting that the end of the input/argument list has not been reached.

Doing that goes bad, though, because any contained expressions could trigger more implied tokens as a side effect, leading to duplicated implied tokens.

This (side-effectful semantic actions) is one of the chief reasons why I usually avoid semantic actions, or use them to store effect only in the rule's (exposed) attribute(s). See Boost Spirit: "Semantic actions are evil"?

TEST DRIVER

This post would be nothing without some nice test cases. So here goes:

int main() {
    for (Source const input : {
            "",       // invalid
            "(3",     // incomplete, imply ')'
            "3*(6+sqrt(9))^7 - 1e8", // completely valid
            "(3*(((6+sqrt(9))^7 - 1e8", // incomplete, imply ")))"
            "print("hello \"world!", // completes the string literal and the parameter list
            "foo",    // okay, known variable
            "baz",    // (suggest bar)
            "baz(",   // incomplete, imply ')', unknown function
            "taz(",   // incomplete, imply ')', unknown function
            "san(",   // 2 suggestions (sin/tan)
            "print(1, 2, "three", complicated(san(78",
            "(print sqrt sin 9)    -0) "bye",
        })
    {
        std::cout << "-------------- '" << input << "'
";
        Location f = input.begin(), l = input.end();

        Ast::Expression expr;
        Completion::Hints hints;
        bool ok = parse(f, l, Parsing::Expression<Location>{hints}, expr);

        if (hints) {
            std::cout << "Input: '" << input << "'
";
        }
        for (auto& c : hints.incomplete) {
            std::cout << "Missing " << std::setw(c.first - input.begin()) << "" << "^ implied: '" << c.second << "'
";
        }
        for (auto& id : hints.suggestions) {
            std::cout << "Unknown " << std::setw(id.first.begin() - input.begin()) << "" << std::string(id.first.size(), '^');
            if (id.second.empty()) 
                std::cout << " (no suggestions)
";
            else {
                std::cout << " (did you mean ";
                once_t first;
                for (auto& s : id.second)
                    std::cout << (first?"":" or ") << "'" << s << "'";
                std::cout << "?)
";
            }
        }

        if (ok) {
            std::cout << "AST:    " << expr << "
";
        } else {
            std::cout << "Parse failed
";
        }

        if (f!=l)
            std::cout << "Remaining input: '" << std::string(f,l) << "'
";
    }
}

Note that, besides the first input ("") everything is heuristically parsed as some kind of expression! The last one is designed to show off how all the parameter lists are implied because print, sqrt and sin are known functions. Then some ',' are implied, before finally the unclosed string literal and remaining parentheses are closed. Here's the (non-debug) output:

-------------- ''
Parse failed
-------------- '(3'
Input: '(3'
Missing   ^ implied: ')'
AST:    3
-------------- '3*(6+sqrt(9))^7 - 1e8'
AST:    ((3 * ((6 + (sqrt (9))) ^ 7)) - 1e+08)
-------------- '(3*(((6+sqrt(9))^7 - 1e8'
Input: '(3*(((6+sqrt(9))^7 - 1e8'
Missing                         ^ implied: ')))'
AST:    (3 * (((6 + (sqrt (9))) ^ 7) - 1e+08))
-------------- 'print("hello "world!'
Input: 'print("hello "world!'
Missing                      ^ implied: '")'
AST:    (print (hello "world!))
-------------- 'foo'
AST:    foo
-------------- 'baz'
Input: 'baz'
Unknown ^^^ (did you mean 'bar'?)
AST:    baz
-------------- 'baz('
Input: 'baz('
Missing     ^ implied: ')'
Unknown ^^^ (no suggestions)
AST:    (baz ())
-------------- 'taz('
Input: 'taz('
Missing     ^ implied: ')'
Unknown ^^^ (did you mean 'tan'?)
AST:    (taz ())
-------------- 'san('
Input: 'san('
Missing     ^ implied: ')'
Unknown ^^^ (did you mean 'sin' or 'tan'?)
AST:    (san ())
-------------- 'print(1, 2, "three", complicated(san(78'
Input: 'print(1, 2, "three", complicated(san(78'
Missing                                        ^ implied: ')))'
Unknown                      ^^^^^^^^^^^ (did you mean 'frobnicate' or 'print'?)
Unknown                                  ^^^ (did you mean 'sin' or 'tan'?)
AST:    (print (1, 2, three, (complicated ((san (78))))))
-------------- '(print sqrt sin 9)    -0) "bye'
Input: '(print sqrt sin 9)    -0) "bye'
Missing        ^ implied: '('
Missing             ^ implied: '('
Missing                 ^ implied: '('
Missing                           ^ implied: ','
Missing                               ^ implied: '"))'
AST:    (print ((sqrt (((sin (9)) - 0))), bye))

Full Listing / Live Demo

Live On Coliru

//#define BOOST_SPIRIT_DEBUG
#include <boost/utility/string_view.hpp>

using Source = boost::string_view;
using Location = Source::const_iterator;

#include <map>
#include <vector>

namespace Completion {

    static int fuzzy_match(Source input, boost::string_view candidate, int rate = 1) { // start with first-letter boost
        int score = 0;

        while (!(input.empty() || candidate.empty())) {
            if (input.front() != candidate.front()) {
                return score + std::max(
                    fuzzy_match(input.substr(1), candidate, std::max(rate-2,0)), //penalty for ignoring an input char
                    fuzzy_match(input, candidate.substr(1), std::max(rate-1,0)));
            }

            input.remove_prefix(1);
            candidate.remove_prefix(1);
            score += ++rate;
        }
        return score;
    }

    using Candidates = std::vector<std::string>;

    class Hints {
        struct ByLocation {
            template <typename T, typename U>
            bool operator()(T const& a, U const& b) const { return loc(a) < loc(b); }
          private:
            static Location loc(Source const& s) { return s.begin(); }
            static Location loc(Location const& l) { return l; }
        };

      public:
        std::map<Location, std::string, ByLocation> incomplete;
        std::map<Source, Candidates, ByLocation> suggestions;

        /*explicit*/ operator bool() const { return incomplete.size() || suggestions.size(); }
    };
}

#include <boost/variant.hpp>

namespace Ast {
    using NumLiteral = double;
    using StringLiteral = std::string; // de-escaped, not source view

    struct Identifier : std::string {
        using std::string::string;
        using std::string::operator=;
    };

    struct BinaryExpression;
    struct CallExpression;

    using Expression = boost::variant<
            NumLiteral,
            StringLiteral,
            Identifier,
            boost::recursive_wrapper<BinaryExpression>,
            boost::recursive_wrapper<CallExpression>
        >;

    struct BinaryExpression {
        Expression lhs;
        char op;
        Expression rhs;
    };

    using ArgList = std::vector<Expression>;

    struct CallExpression {
        Identifier function;
        ArgList args;
    };
}

#include <boost/fusion/adapted/struct.hpp>
BOOST_FUSION_ADAPT_STRUCT(Ast::BinaryExpression, lhs, op, rhs)
BOOST_FUSION_ADAPT_STRUCT(Ast::CallExpression, function, args)

#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/spirit/repository/include/qi_distinct.hpp>

// for debug printing:
namespace {
    struct once_t { // an auto-reset flag
        operator bool() { bool v = flag; flag = false; return v; }
        bool flag = true;
    };
}

// for debug printing:
namespace Ast {

    static inline std::ostream& operator<<(std::ostream& os, std::vector<Expression> const& args) {
        os << "(";
        once_t first;
        for (auto& a : args) os << (first?"":", ") << a;
        return os << ")";
    }

    static inline std::ostream& operator<<(std::ostream& os, BinaryExpression const& e) {
        return os << boost::fusion::as_vector(e);
    }
    static inline std::ostream& operator<<(std::ostream& os, CallExpression const& e) {
        return os << boost::fusion::as_vector(e);
    }
}

namespace Parsing {
    namespace qi  = boost::spirit::qi;
    namespace phx = boost::phoenix;

    template <typename It>
    struct Expression : qi::grammar<It, Ast::Expression()> {
        Expression(Completion::Hints& hints) : Expression::base_type(start), _hints(hints) {
            using namespace qi;

            start      = skip(space) [expression];

            expression = term   [_val = _1] >> *(char_("-+") >> expression) [_val = make_binary(_val, _1, _2)];
            term       = factor [_val = _1] >> *(char_("*/") >> term)       [_val = make_binary(_val, _1, _2)];
            factor     = simple [_val = _1] >> *(char_("^")  >> factor)     [_val = make_binary(_val, _1, _2)];

            simple     = call | variable | compound | number | string;

            auto implied = [=](char ch) {
                return copy(omit[lit(ch) | raw[eps][imply(_1, ch)]]);
            };

            variable   = maybe_known(phx::ref(_variables));

            compound  %= '(' >> expression >> implied(')');

            // The heuristics:
            // - an unknown identifier followed by (
            // - an unclosed argument list implies )
            call %= ( known(phx::ref(_functions)) // known -> imply the parens
                    | &(identifier >> '(') >> unknown(phx::ref(_functions))
                    ) 
                >> implied('(') 
                >> -(expression % (',' | !(')'|eoi) >> implied(',')))
                >> implied(')')
                ;

            // lexemes, primitive rules
            identifier  = raw[(alpha|'_') >> *(alnum|'_')];

            // imply the closing quotes
            string     %= '"' >> *('\' >> char_ | ~char_('"')) >> implied('"'); // TODO more escapes

            number      = double_; // TODO integral arguments

            ///////////////////////////////
            // identifier loopkup, suggesting
            {
                maybe_known = known(_domain) | unknown(_domain);

                // distinct to avoid partially-matching identifiers
                using boost::spirit::repository::qi::distinct;
                auto kw     = distinct(copy(alnum | '_'));

                known       = raw[kw[lazy(_domain)]];
                unknown     = raw[identifier[_val=_1]] [suggest_for(_1, _domain)];
            }

            BOOST_SPIRIT_DEBUG_NODES(
                (start)
                (expression)(term)(factor)(simple)(compound)
                (call)(variable)
                (identifier)(number)(string)
                //(maybe_known)(known)(unknown) // qi::symbols<> non-streamable
            )

            _variables += "foo", "bar", "qux";
            _functions += "print", "sin", "tan", "sqrt", "frobnicate";
        }

      private:
        Completion::Hints& _hints;

        using Domain = qi::symbols<char>;
        Domain _variables, _functions;

        qi::rule<It, Ast::Expression()> start;
        qi::rule<It, Ast::Expression(), qi::space_type> expression, term, factor, simple;
        // completables
        qi::rule<It, Ast::Expression(), qi::space_type> compound;
        qi::rule<It, Ast::CallExpression(), qi::space_type> call;

        // implicit lexemes
        qi::rule<It, Ast::Identifier()> variable, identifier;
        qi::rule<It, Ast::NumLiteral()> number;
        qi::rule<It, Ast::StringLiteral()> string;

        // domain identifier lookups
        qi::_r1_type _domain;
        qi::rule<It, Ast::Identifier(Domain const&)> maybe_known, known, unknown;

        ///////////////////////////////
        // binary expression factory
        struct make_binary_f {
            Ast::BinaryExpression operator()(Ast::Expression const& lhs, char op, Ast::Expression const& rhs) const {
                return {lhs, op, rhs};
            }
        };
        boost::phoenix::function<make_binary_f> make_binary;

        ///////////////////////////////
        // auto-completing incomplete expressions
        struct imply_f {
            Completion::Hints& _hints;
            void operator()(boost::iterator_range<It> where, char implied_char) const {
                auto inserted = 
                    _hints.incomplete.emplace(&*where.begin(), std::string(1, implied_char));
                // add the implied char to existing completion
                if (!inserted.second)
                    inserted.first->second += implied_char;
            }
        };
        boost::phoenix::function<imply_f> imply { imply_f { _hints } };

        ///////////////////////////////
        // suggest_for
        struct suggester {
            Completion::Hints& _hints;

            void operator()(boost::iterator_range<It> where, Domain const& symbols) const {
                using namespace Completion;
                Source id(&*where.begin(), where.size());
                Candidates c;

                symbols.for_each([&](std::string const& k, ...) { c.push_back(k); });

                auto score = [id](Source v) { return fuzzy_match(id, v); };
                auto byscore = [=](Source a, Source b) { return score(a) > score(b); };

                sort(c.begin(), c.end(), byscore);
                c.erase( remove_if(c.begin(), c.end(), [=](Source s) { return score(s) < 3; }), c.end());

                _hints.suggestions.emplace(id, c);
            }
        };
        boost::phoenix::function<suggester> suggest_for {suggester{_hints}};

    };
}

#include <iostream>
#include <iomanip>

int main() {
    for (Source const input : {
            "",       // invalid
            "(3",     // incomplete, imply ')'
            "3*(6+sqrt(9))^7 - 1e8", // completely valid
            "(3*(((6+sqrt(9))^7 - 1e8", // incomplete, imply ")))"
            "print("hello \"world!", // completes the string literal and the parameter list
            "foo",    // okay, known variable
            "baz",    // (suggest bar)
            "baz(",   // incomplete, imply ')', unknown function
            "taz(",   // incomplete, imply ')', unknown function
            "san(",   // 2 suggestions (sin/tan)
            "print(1, 2, "three", complicated(san(78",
            "(print sqrt sin 9)    -0) "bye",
        })
    {
        std::cout << "-------------- '" << input << "'
";
        Location f = input.begin(), l = input.end();

        Ast::Expression expr;
        Completion::Hints hints;
        bool ok = parse(f, l, Parsing::Expression<Location>{hints}, expr);

        if (hints) {
            std::cout << "Input: '" << input << "'
";
        }
        for (auto& c : hints.incomplete) {
            std::cout << "Missing " << std::setw(c.first - input.begin()) << "" << "^ implied: '" << c.second << "'
";
        }
        for (auto& id : hints.suggestions) {
            std::cout << "Unknown " << std::setw(id.first.begin() - input.begin()) << "" << std::string(id.first.size(), '^');
            if (id.second.empty()) 
                std::cout << " (no suggestions)
";
            else {
                std::cout << " (did you mean ";
                once_t first;
                for (auto& s : id.second)
                    std::cout << (first?"":" or ") << "'" << s << "'";
                std::cout << "?)
";
            }
        }

        if (ok) {
            std::cout << "AST:    " << expr << "
";
        } else {
            std::cout << "Parse failed
";
        }

        if (f!=l)
            std::cout << "Remaining input: '" << std::string(f,l) << "'
";
    }
}


¹ Boost spirit skipper issues

这篇关于如何为给定的 boost::spirit 语法提供自动完成建议的用户?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆