存在哪些用于并行化解析器的概念或算法? [英] What concepts or algorithms exist for parallelizing parsers?

查看:58
本文介绍了存在哪些用于并行化解析器的概念或算法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

似乎很容易将解析器并行化为已以拆分格式提供的大量输入数据,例如大量的单个数据库条目,或者很容易通过快速的预处理步骤进行拆分,例如解析大文本中句子的语法结构.

It seems easy to parallelize parsers for large amounts of input data that is already given in a split format, e.g. a large list of individual database entries, or is easy to split by a fast preprocessing step, e.g. parsing the grammatical structure of sentences in large texts.

似乎很难进行并行解析,这已经需要付出很多努力才能在给定输入中定位子结构.通用编程语言代码看起来像一个很好的例子.在Haskell之类的使用布局/缩进来分隔各个定义的语言中,您可能会在找到新定义的开始后检查每行的前导空格数,跳过所有行,直到找到另一个定义并传递每个定义跳过块到另一个线程进行完全解析.

A bit harder seems to be parallel parsing that already requires quite some effort to locate sub-structures in a given input. Common programming language code looks like a good example. In languages like Haskell, that use layout/indentation for separating individual definitions, you could probably check the number of leading spaces of each line after you've found the start of a new definition, skip all lines until you find another definition and pass each skipped chunk to another thread for full parsing.

对于使用平衡括号定义范围的C,JavaScript等语言,进行预处理的工作量会大得多.您需要遍历整个输入,从而计算大括号,注意字符串文字中的文本,等等.对于XML之类的语言而言,情况更糟,您还需要在打开/关闭标签中跟踪标签名称.

When it comes to languages like C, JavaScript etc., that use balanced braces to define scopes, the amount of work for doing the preprocessing would be much higher. You'd need to go through the whole input, thereby counting braces, taking care of text inside string literals and so on. Even worse with languages like XML, where you also need to keep track of tag names in the opening/closing tags.

我发现 CYK解析算法的并行版本似乎适用于所有人上下文无关的语法.但是我很好奇还有什么其他通用概念/算法可以使解析器并行化,包括上述大括号计数这样的事情,它们仅适用于有限的一组语言.这个问题不关乎具体的实现,而是有关实现的思想.

I found a parallel version of the CYK parsing algortihm that seems to work for all context-free grammars. But I'm curious what other general concepts/algorithms do exist that make it possible to parallelize parsers, including such things as the brace counting described above which would only work for a limited set of languages. This question is not about specific implementations but the ideas such implementations are based on.

推荐答案

我想您会在

I think you will find McKeeman's 1982 paper on Parallel LR Parsing quite interesting, as it appears to be practical and applies to a broad class of grammars.

基本方案是标准LR解析.聪明的是,(大概长的)输入被分成了大约N个相等大小的块(对于N个处理器),并且每个块都被分别解析.因为块的起点可能(必须!)在某些生产的中间,所以与经典的LR解析器不同,McKeemans的单个解析器从所有可能的左上下文开始(要求增加LR状态机)来确定哪个LR.项目适用于块.(在一个单独的解析器确定什么状态真正适用之前,不应该花费太多的令牌,因此效率不是很高).然后将所有解析器的结果缝合在一起.

The basic scheme is standard LR parsing. What is clever is that the (presumably long) input is divided into roughly N equal sized chunks (for N processors), and each chunk is parsed separately. Because the starting point for a chunk may (must!) be in the middle of some of productions, McKeemans individual parsers, unlike classic LR parsers, start with all possible left contexts (requiring that the LR state machine be augmented) to determine which LR items apply to the chunk. (It shouldn't take very many tokens before an individual parser has determined what states really apply, so this isn't very inefficient). Then the results of all the parsers are stitched together.

他有点回避在令牌中间划分输入的问题.(您可以想象一个任意大的字符串文字,其中包含看起来像代码的文本,以欺骗解析器从中间开始).似乎发生的情况是解析器遇到错误并放弃了解析;左侧的解析器占用了空闲时间.可以想象,分块器会使用一些技巧来避免这种情况.

He sort of ducks the problem of partitioning the input in the middle of a token. (You can imagine an arbitrarily big string literal containing text that looks like code, to fool the parser the starts in the middle). What appears to happen is that parser runs into an error, and abandons its parse; the parser to its left takes up the slack. One can imagine the chunk splitter to use a little bit of smarts to mostly avoid this.

他去演示一个真正的解析器,可以在其中获得加速.

He goes to demonstrate a real parser in which speedups are obtained.

确实聪明.

这篇关于存在哪些用于并行化解析器的概念或算法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆