使用Raku(以前称为Perl 6)从.bib文件中提取 [英] Extracting from .bib file with Raku (previously aka Perl 6)

查看:148
本文介绍了使用Raku(以前称为Perl 6)从.bib文件中提取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这个 .bib文件参考管理,同时在LaTeX中撰写论文:

  @article {garg2017patch,
title = {疑似化妆品性皮炎:回顾性研究},
author = {Garg,Taru和Agarwal,Soumya和Chander,Ram和Singh,Aashim和Yadav,Pravesh},
journal = {Journal of Cosmetic Dermatology},
year = {2017},
Publisher = {Wiley Online Library}
}

@article {hauso2008neuroendocrine,
title = {Neuroendocrine肿瘤流行病学} ,
author = {Hauso,Oyvind and Gustafsson,Bjorn I and Kidd,Mark and Waldum,Helge L and Drozdov,Ignat and Chan,Anthony KC and Modlin,Irvin M},
journal = {Cancer} ,
量= {113},
数= {10},
页= {2655--2664},
year = {2008},
Publisher = {wiley在线图书馆}
}

@article {siperstein1997lap腹腔镜,
title = {腹腔镜热消融治疗肝神经内分泌肿瘤转移},
author = {Siperstein,Allan E and Rogers,Stanley J and Hansen,Paul D and Gitomirsky,Alexis},
journal = {手术},
量= {122},
数= {6},
页= {1147--1155},
year = {1997},
Publisher = {Elsevier}
}

如果有人想知道什么bib文件是的,您可以在此处中找到它的详细信息。。。 p>

我想用Perl 6对此进行解析,以提取密钥以及标题,如下所示:

  garg2017patch:怀疑有美容性皮炎的患者进行的斑块测试:一项回顾性研究

hauso2008neuroendocrine:神经内分泌肿瘤流行病学

siperstein1997腹腔镜:腹腔镜热消融肝神经内分泌肿瘤转移

可以请您帮我做这件事,也许有两种方法:


  1. 使用基本的Perl 6

  2. 使用Perl 6语法


解决方案

此答案的目的是同时满足以下条件:




  • 我想...可以使用Perl 6解析X。任何人都可以帮忙吗?


  • 一个完整而详细的答案,其作用与@Suman的要求完全相同。







在单个语句(高级用户)中



  $ _ [0]:$ _ [1] \n .put 
for(slurp'derm.bib')
~~ m :g /'@article {'(<-[,]> +)','\s +'title = {'〜'}'(<-[}]> +)/

运行



我决定先从熟悉P6的开发人员在几分钟内编写的代码开始您在y中指定的简单任务我们的问题是,他们是否不太关心新手的可读性。



我不会对此进行解释。它只是做的工作。如果您是P6的新手,那么您可能会觉得不知所措。如果是这样,请阅读我的其余回答-这会使事情变慢,并具有全面的注释。也许回到这里,看完其余内容后是否更有意义。



基本Perl 6解决方案



 我的输入= slurp'derm.bib'; 

我的\pattern =规则{'@article {'(<-[,]> +)','
'title = {'〜'}'(< -[}]> +)}

我的文章= input.match:pattern,:global;

的文章-> $ / {print $ 0:$ 1\n\n}

这几乎与单个语句(高级用户)代码-分为四个语句,而不是一个。我本可以使它更紧密地复制代码的第一个版本,但是做了一些我将解释的更改。我这样做是为了使人更加清楚,P6故意使其功能具有可伸缩性和可重构性,因此可以混合使用,并匹配最适合给定用例的任何功能。

 我的输入= slurp'derm.bib'; 

Perls因其名字而闻名。在P6中,如果您不需要它们,可以将其砍掉。 Perls还以简洁的处理方式而闻名。 slurp 一次性读取整个文件。

  my \ \pattern =规则{'@article {'(<-[,]> +)','
'title = {'〜'}'(<-[}]> +)}

Perl 6模式通常被称为正则表达式或规则。正则表达式/规则有几种类型。模式语言是相同的;不同的类型仅指示匹配引擎修改其处理给定模式的方式。



一种正则表达式/规则类型与经典正则表达式的P6等效。这些用 /.../ regex {...} 声明。开头的高级用户代码中的正则表达式就是这些正则表达式之一。它们的区别是它们在必要时会回溯,就像经典的正则表达式一样。



不需要回溯来匹配 .bib 格式。除非您需要回溯,否则最好考虑使用其他规则类型之一。我已切换为使用关键字 rule 声明的规则。



使用规则与使用 regex (或 /.../ )声明的规则相同除了A)不会回溯和B)它将模式中的空格解释为与输入中可能的空格相对应。您是否发现我刚在’title = {’之前从模式中删除了 \s + 吗?那是因为规则会自动处理该问题。



另一个区别是我写了:

 'title = {'〜'}'(...)

而不是:

 'title = {'(...) '}'

ie在花括号之后中移动与花括号之间的位匹配的模式,并将放在花括号之间。它们匹配相同的整体模式。我可以使用高级用户 /.../ 模式编写任何内容,也可以按照本节的规则编写任何内容图案。但是我希望本节更加最佳实践。我将对这种区别以及该模式的所有其他详细信息进行完整的解释,直到下面的部分。

 我的文章= input.match:pattern,:global; 

此行使用 m 例程在早期的高级用户版本中使用。



:global 与<$ c $相同c>:g 。我可以在两个版本中都写它。



添加:global (或 :g )调用 .match 方法(或 m 例程)时进入参数列表),如果您要搜索整个匹配的字符串,则查找尽可能多的匹配项,而不仅仅是第一个。然后,该方法(或 m 例程)返回 匹配个对象,而不只是一个。在这种情况下,我们将获得三个,对应于输入文件中的三篇文章。

 文章-> $ / {print $ 0:$ 1\n\n} 

每个 $ / 上的P6文档, $ / 是match变量...因此通常包含Match类型的对象。。它还提供了其他一些便利,我们将在此处利用这些便利之一,如下所述。



用于循环将每个 overall Match对象(对应于您的示例文件中的每条由语法成功解析的文章)绑定到符号 $ / 块内。



该模式包含两对括号。这些生成位置捕获。整个Match对象通过位置下标(后缀 [] )。因此,在 for 块内, $ / [0] $ / [1] 提供对给定文章的两个位置捕获的访问。但是 $ 0 $ 1 也是如此-因为标准P6将后面的这些符号别名为 $ / [0] $ / [1] 为方便起见。






还是和我在一起吗?



此答案的后半部分逐渐形成,并彻底解释了基于语法的方法。阅读它可能会提供对上述解决方案的进一步了解。



但是首先...



无聊实际答案




我想用Perl 6解析它。有人可以帮忙吗?


P6可能使编写解析器比使用其他工具乏味。但是,减少乏味仍然是乏味的。而且P6解析目前很慢。



在大多数情况下,当您要解析除最简单的文件格式以外的任何内容时的实用答案,尤其是众所周知的



您可能会以在modules.perl6.org 上搜索 bib,以期找到公开共享的 bib解析模块。围绕非P6库的纯Perl 6一个或一些P6包装器。但是在撰写本文时,没有'bib'的匹配项。



几乎可以肯定已经有一个'bib'解析C库。它可能是最快的解决方案。即使您不了解C ,也很可能可以轻松,优雅地使用打包为C库的外部解析库(以自己的P6代码)使用。如果 NativeCall 的解释太多或太少,请考虑访问freenode IRC频道#perl6 ,并询问您需要或需要的任何NativeCall帮助。



如果C lib不适合特定用例,那么您仍然可以通过 Inline :: * 语言适配器。只需安装Perl 5,Python或所需的任何软件包即可;确保它使用其他语言运行;安装适当的语言适配器;然后使用该软件包及其功能,就好像它是P6软件包一样,其中包含导出的P6函数,类,对象,值等



Perl 5适配器是最成熟的适配器,因此我以它为例。假设您使用Perl 5的Text :: BibTex软件包,现在希望将Perl 6与Perl 5中现有的Text :: BibTeX :: BibFormat模块一起使用。首先,设置Perl 5软件包,因为它们应该按照其自述文件等进行设置。然后,在Perl 6中,编写如下内容:

  use Text :: BibTeX :: BibFormat:from< Perl5> ;; 
...
@blocks = $ entry.format;

第一行是如何告诉P6您希望加载P5模块的方法。 (除非已经安装并运行 Inline :: Perl5 ,否则它将无法正常工作。但是如果您使用的是流行的Rakudo Perl 6捆绑软件,则应该这样。如果没有,您至少应该具有模块安装程序 zef ,以便可以运行 zef install Inline :: Perl5 。)



最后一行只是 @blocks = $ entry-> format; 行的机械式P6翻译href = https://metacpan.org/pod/Text::BibTeX::BibFormat#SYNOPSIS rel = noreferrer>有关Perl 5的内容简介:: BibTeX :: BibFormat 。



创建P6语法/解析器



确定。足够的无聊实用建议。现在,让我们尝试一些有趣的操作,为您的问题示例创建足够好的P6解析器。

 #use Grammar :: Tracer; 

语法围兜{

条规则顶部{< article *)}

条规则{'@article {'$&id; = <-[,]>','
< kv-pairs>
'}'
}

规则kv对{< kv-pair> *%','}

规则kv对{ $< key> = \w *'= {'〜'}'$< value> =<-[}]&*; *}

}

有了此语法,我们现在可以编写以下内容:

  die也许使用语法:: Tracer?除非bib.parsefile: derm.bib; 

for $< article> {说。< id> 〜’:’〜。< kv-pairs>< kv-pair> [0]< value> 〜 \n}

生成与早期超级用户完全相同的输出和基本Perl 6解决方案-但使用语法/解析器方法。



bib语法的解释



 #使用语法:: Tracer; 

如果解析失败,则返回值为 Nil 。 P6不会告诉您距离有多远。您将没有零线索来解释为什么解析失败。



如果您没有一个更好的选择(?),然后,当语法失败时,使用Grammar :: Tracer 进行调试(如果不安装,请先安装

 语法围嘴{

语法关键字类似于 class ,但是语法可以包含不仅像往常一样被命名为方法 s,而且还被命名为 regex s,令牌 s和规则 s。

 规则顶部{

除非另有说明,否则解析例程首先调用规则(或令牌 regex method )命名为 TOP



作为一个经验法则,如果您不知道是否应该使用规则正则表达式令牌方法进行一些解析,请使用令牌。 (与 regex 模式不同,令牌不会回溯,因此它们消除了由于回溯而不必要地缓慢运行的风险。 )



但是在这种情况下,我使用了规则。像令牌模式一样,规则也可以避免回溯。但是此外,它们以自然方式将跟随模式中任何原子的空白视为有效。这通常适合于解析树的顶部。 (令牌和偶尔的正则表达式通常适合于叶子。)

 规则TOP {< article **} 

规则末尾的空格表示语法将与规则末尾的任意空格匹配



< article> 在此语法中调用另一个命名规则(或令牌/正则表达式/方法)。



由于看起来每个围嘴文件应允许任何数量的文章,因此我添加了 * (< <末尾处的href = https://docs.perl6.org/language/regexes#Zero_or_more:_* rel = noreferrer>零个或多个量词)。 article> *

 规则文章{'@article {'$< id> =< ;-[,]> +','
< kv-pairs>
'}'
}

我有时会制定类似规则典型输入外观。我在这里尝试这样做。



< [...]> 是a的P6语法字符类,如传统正则表达式语法中的 [...] 。它功能更强大,但现在您只需要知道<-[,]> 中的-表示取反,即与以前的 [^,] 语法中的 ^ 相同。因此<-[,]> + 尝试匹配一个或多个字符,都不是



$< id> =<-[,]&+; $ 告诉P6尝试匹配 = 右边的量化原子(即<-[,]> + 位),并且将结果存储在当前Match对象内的键'id'中。后者将挂在解析树的一个分支上。

 规则kv-pairs {< kv-pair> *%','} 

此正则表达式代码说明了几种便捷的P6正则表达式功能之一。它表示您要匹配零个或多个以逗号分隔的 kv对 s



(更详细地, regex中缀运算符要求其左侧的量化原子匹配与右侧的原子隔开。)

 规则kv对{$< key> = \w *'= {'〜'}'$< value> =<-[}]> *} 

这里的新位是'= {'〜'}'。这是另一个方便的正则表达式功能。 regex Tilde运算符解析带分隔符的结构(本例中为 = {开启器和} 闭合器),分隔符之间的位与闭合器右侧的量化正则表达式原子匹配。这可以带来一些好处,但是主要的好处是错误消息可以更加清晰。



解析树的构造/解构的解释



最后的 $< article> 。< id> 等位行(为$

{说。< id>〜':'〜。< kv-pairs>< kv-pair> [0]< value>〜 \ n} )是指匹配存储在成功解析后生成并返回的解析树中的对象。



返回顶部语法:

 规则顶部{

如果解析成功,则返回单个'TOP'级别 Match对象,该对象对应于解析树顶部。 (在解析方法调用之后,也可以通过变量 $ / 立即对其进行编码。)



但在此之前解析的最终结果发生了,代表整个解析子部分的许多 other Match对象将被生成并添加到解析树中。通过将单个生成的Match对象或它们的列表分配给位置或父匹配对象的 Associative 元素,如下所述

 规则顶部{< article> *} 

< article> 这样的规则调用有两个作用。首先,P6尝试匹配规则。其次,如果匹配,则P6生成一个相应的Match对象,并将其添加到解析树中。



如果成功匹配的模式只是< ; article> 而不是< article> * ,那么将只尝试进行一次匹配,并且仅尝试一个值,即单个Match对象,



但是模式是< article> * ,而不是仅仅是< article> 。因此,P6尝试多次匹配 article 规则。如果至少匹配一次,则它将生成并存储一个或多个Match对象的相应 list 。 (有关更详细的说明,请参见我对如何在比赛中访问捕获的内容? 的回答。)



因此将Match对象的列表分配给TOP级Match对象的'article'键。 (如果匹配的正则表达式只是

而不是
* ,则匹配结果只是将一个Match对象分配给'article'键,而不是它们的列表。)



所以现在我将尝试解释最后一行代码中的 $&

部分,

  for $< article> {说。< id> 〜’:’〜。< kv-pairs>< kv-pair> [0]< value> 〜 \n} 

$< article> $ /的缩写。



每个 $ / 上的P6文档, $ / 是匹配项变量。它存储最后一个Regex匹配的结果,因此通常包含Match类型的对象。



在我们的例子中,最后一个Regex匹配是 TOP 规则中的规则。



所以 $< article> 是解析返回的TOP级别Match对象的'article'键下的值。此值是3个'article'级别匹配对象的列表。

 规则文章{'@ article {'$&id; gt; =<-[,]> +','

文章正则表达式反过来在作业的左侧包含 $< id> 。这对应于将Match对象分配给添加到 article level Match对象的新'id'键。



希望这足够了(也许太多了!),我现在可以解释一下最后一行代码,再次是:



<$ p $< article>中的$ p> {说。< id> 〜’:’〜。< kv-pairs>< kv-pair> [0]< value> 〜 \n}

>遍历在解析过程中生成并存储在TOP级别的'article'键下的3个Match对象(对应于输入中的3个文章)的列表匹配对象。



(此迭代会自动将这三个子匹配对象分别分配给 $ _ ,也称为或主题,然后在每次分配后在代码块中执行代码( {...} )。代码块中的代码通常会引用, $ _ 。)



。< id>块中的位等效于 $ _。< id> ,即它隐式引用了 $ _ 。如前所述, $ _ 商品级的Match对象,这次正在处理附近for 循环。 < id> 位表示。< id> 返回存储在文章级匹配对象的>'id'键。



最后, 。< kv-pairs>< kv-pair> [0]< value> 位引用存储在'value下的Match对象将Match对象的'键存储为Match对象的 kv-pair 键下存储的Match对象列表的第一个(第0个)元素与 kv-pairs 规则相对应的对象,该规则又存储在'kv-pairs'键的下面code>商品级匹配对象。



Phew!



何时自动生成的解析树不是您想要的



如果以上所有内容都不足够,我还需要再提一件事。



分析树强烈反映了语法的隐式树结构。但是,由于解析而获得这种结构有时是不便的-人们可能想要一个不同的树结构,也许是一个更简单的树,也许是一些非树数据结构。



当自动结果不合适时,从解析中准确生成所需内容的主要机制是使用制作。 (可以在规则内的代码块中使用,也可以将其分解为动作类

反过来, make 的主要用例是生成稀疏树

最后,这些稀疏树的主要用例是存储AST。


I have this .bib file for reference management while writing my thesis in LaTeX:

@article{garg2017patch,
  title={Patch testing in patients with suspected cosmetic dermatitis: A retrospective study},
  author={Garg, Taru and Agarwal, Soumya and Chander, Ram and Singh, Aashim and Yadav, Pravesh},
  journal={Journal of Cosmetic Dermatology},
  year={2017},
  publisher={Wiley Online Library}
}

@article{hauso2008neuroendocrine,
  title={Neuroendocrine tumor epidemiology},
  author={Hauso, Oyvind and Gustafsson, Bjorn I and Kidd, Mark and Waldum, Helge L and Drozdov, Ignat and Chan, Anthony KC and Modlin, Irvin M},
  journal={Cancer},
  volume={113},
  number={10},
  pages={2655--2664},
  year={2008},
  publisher={Wiley Online Library}
}

@article{siperstein1997laparoscopic,
  title={Laparoscopic thermal ablation of hepatic neuroendocrine tumor metastases},
  author={Siperstein, Allan E and Rogers, Stanley J and Hansen, Paul D and Gitomirsky, Alexis},
  journal={Surgery},
  volume={122},
  number={6},
  pages={1147--1155},
  year={1997},
  publisher={Elsevier}
}

If anyone wants to know what bib file is, you can find it detailed here.

I'd like to parse this with Perl 6 to extract the key along with the title like this:

garg2017patch: Patch testing in patients with suspected cosmetic dermatitis: A retrospective study

hauso2008neuroendocrine: Neuroendocrine tumor epidemiology

siperstein1997laparoscopic: Laparoscopic thermal ablation of hepatic neuroendocrine tumor metastases

Can you please help me to do this, maybe in two ways:

  1. Using basic Perl 6
  2. Using a Perl 6 Grammar

解决方案

This answer is aimed at being both:

  • An introductory general answer to "I want to parse X with Perl 6. Can anyone help?"

  • A complete and detailed answer that does exactly as @Suman asks.


In a single statement (power user)

"$_[0]: $_[1]\n" .put
  for (slurp 'derm.bib')
    ~~ m:g/ '@article{' (<-[,]>+) ',' \s+ 'title={' ~ '}' (<-[}]>+) /

(Run this code at glot.io.)

I decided to start with the sort of thing a dev familiar with P6 would write in a few minutes to do just the simple task you've specified in your question if they didn't much care about readability for newbies.

I'm not going to provide an explanation of it. It just does the job. If you're a P6 newbie it could well be overwhelming. If so, please read the rest of my answer -- it takes things slower and has comprehensive commentary. Perhaps return here and see if it makes more sense after reading the rest.

A "basic Perl 6" solution

my \input      = slurp 'derm.bib' ;

my \pattern    = rule { '@article{'       ( <-[,]>+ ) ','
                          'title={' ~ '}' ( <-[}]>+ ) }

my \articles   = input.match: pattern, :global ;

for articles -> $/ { print "$0: $1\n\n" }

This is almost identical to the "single statement (power user)" code -- broken into four statements rather than one. I could have made it more closely copy the first version of the code but have instead made a few changes that I'll explain. I've done this to make it clearer that P6 deliberately has its features be a scalable and refactorable continuum so one can mix and, er, match whatever features best fits a given use case.

my \input      = slurp 'derm.bib' ;

Perls are famous for their sigils. In P6, if you don't need them you can "slash" them out. Perls are also famous for having terse ways of doing things. slurp reads a file in its entirety in one go.

my \pattern    = rule { '@article{'       ( <-[,]>+ ) ','
                          'title={' ~ '}' ( <-[}]>+ ) }

Perl 6 patterns are generically called regexes or Rules. There are several types of regexes/rules. The pattern language is the same; the distinct types just direct the matching engine to modify how it handles a given pattern.

One regex/rule type is the P6 equivalent of classic regexes. These are declared with either /.../ or regex {...}. The regex in the opening "power user" code was one of these regexes. Their distinction is that they backtrack when necessary, just like classic regexes.

There's no need for backtracking to match the .bib format. Unless you need backtracking, it's wise to consider using one of the other rule types instead. I've switched to a rule declared with the keyword rule.

A rule declared with rule is identical to one declared with regex (or /.../) except that A) it doesn't backtrack and B) it interprets spaces in its pattern as corresponding to possible spaces in the input. Did you spot that I'd dropped the \s+ from the pattern immediately before 'title={'? That's because a rule takes care of that automatically.

The other difference is that I wrote:

'title={' ~ '}' ( ... )

instead of:

'title={' ( ... ) '}'

i.e. moving the pattern matching the bit between the braces after the braces and putting a ~ in between the braces instead. They match the same overall pattern. I could have written things either way in the power user /.../ pattern and either way in this section's rule pattern. But I wanted this section to be a bit more "best practice" oriented. I'll defer a full explanation of this difference and all the other details of this pattern until the Explanation of 'bib' grammar section below.

my \articles   = input.match: pattern, :global ;

This line uses the method form of the m routine used in the earlier "power user" version.

:global is the same as :g. I could have written it either way in both versions.

Add :global (or :g) to the argument list when invoking the .match method (or m routine) if you want to search the entire string being matched, finding as many matches as there are, not just the first. The method (or m routine) then returns a list of Match objects rather than just one. In this case we'll get three, corresponding to the three articles in the input file.

for articles -> $/ { print "$0: $1\n\n" }

Per P6 doc on $/, "$/ is the match variable ... so usually contains objects of type Match.". It also provides some other conveniences and we take advantage of one of these conveniences here as explained next.

The for loop successively binds each of the overall Match objects (corresponding to each of the articles in your sample file that were successfully parsed by the grammar) to the symbol $/ inside the for's block.

The pattern contains two pairs of parentheses. These generate "Positional captures". The overall Match object provides access to its two Positional captures via Positional subscripting (postfix []). Thus, within the for block, $/[0] and $/[1] provide access to the two Positional captures for a given article. But so do $0 and $1 -- because standard P6 aliases these latter symbols to $/[0] and $/[1] for your convenience.


Still with me?

The latter half of this answer builds up and thoroughly explains a grammar-based approach. Reading it may provide further insight into the solutions above.

But first...

A "boring" practical answer

I want to parse this with Perl 6. Can anyone help?

P6 may make writing parsers less tedious than with other tools. But less tedious is still tedious. And P6 parsing is currently slow.

In most cases, the practical answer when you want to parse anything beyond the most trivial of file formats -- especially a well known format that's several decades old -- is to find and use an existing parser.

You might start with a search for 'bib' on modules.perl6.org in the hope of finding a publicly shared 'bib' parsing module. Either a pure Perl 6 one or some P6 wrapper around a non-P6 library. But at the time of writing this there are no matches for 'bib'.

There's almost certainly a 'bib' parsing C library already available. And it's likely to be the fastest solution. It's also likely that you can easily and elegantly use an external parsing library packaged as a C lib, in your own P6 code, even if you don't know C. If NativeCall is either too much or too little explanation, consider visiting the freenode IRC channel #perl6 and asking for whatever NativeCall help you need or want.

If a C lib isn't right for a particular use case then you can probably still use packages written in Perl 5, Python, Ruby, Lua, etc. via their Inline::* language adapters. Just install the Perl 5, Python or whatever package that you want; make sure it runs using that other language; install the appropriate language adapter; then use the package and its features as if it were a P6 package containing exported P6 functions, classes, objects, values, etc.

The Perl 5 adapter is the most mature so I'll use that as an example. Let's say you use Perl 5's Text::BibTex packages and now wish to use Perl 6 with the existing Text::BibTeX::BibFormat module from Perl 5. First, setup the Perl 5 packages as they are supposed to be per their README's etc. Then, in Perl 6, write something like:

use Text::BibTeX::BibFormat:from<Perl5>;
...
@blocks = $entry.format;

The first line is how you tell P6 that you wish to load a P5 module. (It won't work unless Inline::Perl5 is already installed and working. But it should be if you're using a popular Rakudo Perl 6 bundle. And if not, you should at least have the module installer zef so you can run zef install Inline::Perl5.)

The last line is just a mechanical P6 translation of the @blocks = $entry->format; line from the SYNOPSIS of the Perl 5 Text::BibTeX::BibFormat.

Creating a P6 grammar / parser

OK. Enough "boring" practical advice. Let's now try have some fun creating a P6 parser good enough for the example from your question.

# use Grammar::Tracer;

grammar bib {

    rule TOP           { <article>* }

    rule article       { '@article{' $<id>=<-[,]>+ ','
                            <kv-pairs>
                         '}'
    }

    rule kv-pairs      { <kv-pair>* % ',' }

    rule kv-pair       { $<key>=\w* '={' ~ '}' $<value>=<-[}]>* }

}

With this grammar in place, we can now write something like:

die "Maybe use Grammar::Tracer?" unless bib.parsefile: 'derm.bib';

for $<article> { say .<id> ~ ': ' ~ .<kv-pairs><kv-pair>[0]<value> ~ "\n" }

to generate exactly the same output as with the earlier "power user" and "basic Perl 6" solutions -- but using a grammar / parser approach.

Explanation of 'bib' grammar

# use Grammar::Tracer;

If a parse fails, the return value is Nil. P6 won't tell you how far it got. You'll have zero clue why your parse failed.

If you don't have a better option (?), then, when your grammar fails, use Grammar::Tracer to help debug (installing it first if you don't already have it installed).

grammar bib {

The grammar keyword is like class, but a grammar can contain not just named methods as usual but also named regexs, tokens, and rules.

    rule TOP           {

Unless you specify otherwise, parsing routines start out by calling the rule (or token, regex, or method) named TOP.

As a, er, rule of thumb, if you don't know if you should be using a rule, regex, token, or method for some bit of parsing, use a token. (Unlike regex patterns, tokens don't backtrack so they eliminate the risk of unnecessarily running slowly due to backtracking.)

But in this case I've used a rule. Like token patterns, rules also avoid backtracking. But in addition they take whitespace following any atom in the pattern to be significant in a natural manner. This is typically appropriate towards the top of the parse tree. (Tokens, and the occasional regex, are typically appropriate towards the leaves.)

    rule TOP           { <article>* }

The space at the end of the rule means the grammar will match any amount of whitespace at the end of the input.

<article> invokes another named rule (or token/regex/method) in this grammar.

Because it looks like one should allow for any number of articles per bib file, I added a * (zero or more quantifier) at the end of <article>*.

    rule article       { '@article{' $<id>=<-[,]>+ ','
                            <kv-pairs>
                         '}'
    }

I sometimes lay rules out to resemble the way typical input looks. I tried to do so here.

<[...]> is the P6 syntax for a character class, like[...] in traditional regex syntax. It's more powerful but for now all you need to know is that the - in <-[,]> indicates negation, i.e. the same as the ^ in ye olde [^,] syntax. So <-[,]>+ attempts a match of one or more characters, none of which are ,.

$<id>=<-[,]>+ tells P6 to attempt to match the quantified atom on the right of the = (i.e. the <-[,]>+ bit) and store the results at the key 'id' within the current Match object. The latter will be hung from a branch of the parse tree; we'll get to precisely where later.

    rule kv-pairs      { <kv-pair>* % ',' }

This regex code illustrates one of several convenient P6 regex features. It says you want to match zero or more kv-pairs separated by commas.

(In more detail, the % regex infix operator requires that matches of the quantified atom on its left are separated by the atom on its right.)

    rule kv-pair       { $<key>=\w* '={' ~ '}' $<value>=<-[}]>* }

The new bit here is '={' ~ '}'. This is another convenient regex feature. The regex Tilde operator parses a delimited structure (in this case one with a ={ opener and } closer) with the bit between the delimiters matching the quantified regex atom on the right of the closer. This confers several benefits but the main one is that error messages can be much clearer.

An explanation of the parse tree's construction/deconstruction

The $<article> and .<id> etc. bits in the last line (for $<article> { say .<id> ~ ':' ~ .<kv-pairs><kv-pair>[0]<value> ~ "\n" }) refer to Match objects that are stored in the parse tree that's generated and returned from a successful parse.

Returning to the top of the grammar:

    rule TOP           {

If a parse is successful, a single 'TOP' level Match object, the one corresponding to the top of the parse tree, is returned. (It's also made available to code immediately following the parse method call via the variable $/.)

But before that final return from parsing happens, many other Match objects, representing sub parts of the overall parse, will have been generated and added to the parse tree. Addition of Match objects to a parse tree is done by assigning either a single generated Match object, or a list of them, to either a Positional or Associative element of a "parent" Match object, as explained next.

    rule TOP           { <article>* }

A rule invocation like <article> has two effects. First, P6 tries to match the rule. Second, if it matches, P6 generates a corresponding Match object and adds it to the parse tree.

If the successfully matched pattern had been just <article>, rather than <article>*, then only one match would have been attempted and only one value, a single Match object, would have been generated and added to the parse tree.

But the pattern was <article>*, not merely <article>. So P6 attempts to match the article rule multiple times. If it matches at least once then it generates and stores a corresponding list of one or more Match objects. (See my answer to "How do I access the captures within a match?" for a more detailed explanation.)

So a list of Match objects is assigned to the 'article' key of the TOP level Match object. (If the matching regex expression had been just <article> rather than <article>* then a match would result in just a single Match object being assigned to the 'article' key rather than a list of them.)

So now I'll try to explain the $<article> part of the last line of code, which was:

for $<article> { say .<id> ~ ': ' ~ .<kv-pairs><kv-pair>[0]<value> ~ "\n" }

$<article> is short for $/.<article>.

Per P6 doc on $/, "$/ is the match variable. It stores the result of the last Regex match and so usually contains objects of type Match.".

The last Regex match in our case was the TOP rule from the bib grammar.

So $<article> is the value under the 'article' key of the TOP level Match object returned by the parse. This value is a list of 3 'article' level Match objects.

    rule article       { '@article{' $<id>=<-[,]>+ ','

The article regex in turn contains $<id> on the left side of an assignment. This corresponds to assigning a Match object to a new 'id' key added to the article level Match object.

Hopefully this is enough (perhaps too much!) and I can now explain the last line of code, which, once again, was:

for $<article> { say .<id> ~ ': ' ~ .<kv-pairs><kv-pair>[0]<value> ~ "\n" }

The for iterates over the list of 3 Match objects (corresponding to the 3 articles in the input) that were generated during the parse and stored under the 'article' key of the TOP level Match object.

(This iteration automatically assigns each of these three sub Match objects to $_, aka "it" or "the topic", and then, after each assignment, does the code in the block ({ ... }). The code in the block will typically refer, either explicitly or implicitly, to $_.)

The .<id> bit in the block is equivalent to $_.<id>, i.e. it implicitly refers to $_. As just explained, $_ is the article level Match object being processed this time around the for loop. The <id> bit means .<id> returns the Match object stored under the 'id' key of the article level Match object.

Finally, the .<kv-pairs><kv-pair>[0]<value> bit refers to the Match object stored under the 'value' key of the Match object stored as the first (0th) element of the list of Match objects stored under the kv-pair key of the Match object corresponding to the kv-pairs rule which in turn is stored under the 'kv-pairs' key of an article level Match object.

Phew!

When the automatically generated parse tree isn't what you want

As if all the above were not enough, I need to mention one more thing.

The parse tree strongly reflects the implicit tree structure of the grammar. But getting this structure as a result of a parse is sometimes inconvenient -- one may want a different tree structure instead, perhaps a much simpler tree, perhaps some non-tree data structure.

The primary mechanism for generating exactly what you want from a parse when the automatic results aren't suitable is use of make. (This can be used in code blocks inside rules or factored out into Action classes that are separate from grammars.)

In turn, the primary use case for make is to generate a sparse tree of nodes hanging off the parse tree.

Finally, the primary use case for these sparse trees is storing an AST.

这篇关于使用Raku(以前称为Perl 6)从.bib文件中提取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆