使用Perl 6从.bib文件中提取 [英] Extracting from .bib file with Perl 6

查看:104
本文介绍了使用Perl 6从.bib文件中提取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这个 .bib文件在LaTeX中撰写论文时进行参考管理:

@article{garg2017patch,
  title={Patch testing in patients with suspected cosmetic dermatitis: A retrospective study},
  author={Garg, Taru and Agarwal, Soumya and Chander, Ram and Singh, Aashim and Yadav, Pravesh},
  journal={Journal of Cosmetic Dermatology},
  year={2017},
  publisher={Wiley Online Library}
}

@article{hauso2008neuroendocrine,
  title={Neuroendocrine tumor epidemiology},
  author={Hauso, Oyvind and Gustafsson, Bjorn I and Kidd, Mark and Waldum, Helge L and Drozdov, Ignat and Chan, Anthony KC and Modlin, Irvin M},
  journal={Cancer},
  volume={113},
  number={10},
  pages={2655--2664},
  year={2008},
  publisher={Wiley Online Library}
}

@article{siperstein1997laparoscopic,
  title={Laparoscopic thermal ablation of hepatic neuroendocrine tumor metastases},
  author={Siperstein, Allan E and Rogers, Stanley J and Hansen, Paul D and Gitomirsky, Alexis},
  journal={Surgery},
  volume={122},
  number={6},
  pages={1147--1155},
  year={1997},
  publisher={Elsevier}
}

如果有人想知道什么是bib文件,则可以找到详细的解决方案

此答案的目的是同时满足以下条件:

  • 我想用Perl 6解析X"的介绍性一般回答.

  • 一个完整而详细的答案,完全符合@Suman的要求.


在单个语句中(高级用户)

"$_[0]: $_[1]\n" .put
  for (slurp 'derm.bib')
    ~~ m:g/ '@article{' (<-[,]>+) ',' \s+ 'title={' ~ '}' (<-[}]>+) /

(在glot.io上运行此代码.)

我决定从熟悉P6的开发人员在几分钟之内开始写的事情开始,如果他们对新手的可读性不太在意,则可以完成您在问题中指定的简单任务.

我将不提供解释.它只是做的工作.如果您是P6的新手,那么它很可能会让人不知所措.如果是这样,请阅读我的回答的其余部分-这会使速度变慢并且具有全面的注释.也许回到这里,看完其余部分是否更有意义.

基本Perl 6"解决方案

my \input      = slurp 'derm.bib' ;

my \pattern    = rule { '@article{'       ( <-[,]>+ ) ','
                          'title={' ~ '}' ( <-[}]>+ ) }

my \articles   = input.match: pattern, :global ;

for articles -> $/ { print "$0: $1\n\n" }

这几乎与单个语句(高级用户)"代码相同-分为四个语句而不是一个.我本可以使它更紧密地复制代码的第一个版本,但是做了一些我将解释的更改.我这样做是为了使人更加清楚,P6故意将其功能设为可扩展和可重构的连续体,以便人们可以混合使用并匹配最适合给定用例的任何功能.

my \input      = slurp 'derm.bib' ;

Perls因其名字而闻名.在P6中,如果不需要它们,可以将其砍掉". Perls还以简洁的处理方式而闻名. slurp一次读取整个文件.

my \pattern    = rule { '@article{'       ( <-[,]>+ ) ','
                          'title={' ~ '}' ( <-[}]>+ ) }

Perl 6模式通常称为正则表达式或规则.有几种类型的正则表达式/规则.模式语言是相同的;不同类型只是指导匹配引擎修改其处理给定模式的方式.

一种正则表达式/规则类型是经典正则表达式的P6等效项.这些用/.../regex {...}声明.开头的高级用户"代码中的正则表达式就是这些正则表达式之一.它们的区别在于它们在必要时会回溯,就像经典的正则表达式一样.

无需回溯以匹配.bib格式.除非您需要回溯,否则最好考虑使用其他规则类型之一.我已切换为使用关键字rule声明的规则.

rule声明的规则与用regex(或/.../)声明的规则相同,不同之处在于A)不会回溯,并且B)它将模式中的空格解释为与空格中的可能空格相对应.输入.您是否发现我从'title={'之前的模式中删除了\s+?这是因为rule会自动处理该问题.

另一个区别是我写了:

'title={' ~ '}' ( ... )

代替:

'title={' ( ... ) '}'

即在花括号之后中移动与花括号之间的位匹配的模式,并在花括号之间放一个~.它们匹配相同的整体模式.我可以用高级用户/.../模式编写的东西,也可以用本节rule模式编写的东西.但是我希望本节更加最佳实践".我将对这种差异以及该模式的所有其他详细信息进行完整的解释,直到下面的'bib'语法解释部分.

my \articles   = input.match: pattern, :global ;

此行使用早期高级用户"版本中使用的m例程的方法形式.

:global:g相同.我可以在两个版本中都写它.

如果要搜索要匹配的整个字符串,并查找尽可能多的匹配项,则在调用.match方法(或m例程)时,在参数列表中添加:global(或:g),不只是第一个.然后,该方法(或m例程)返回 Match对象的列表只有一个.在这种情况下,我们将获得三个,对应于输入文件中的三篇文章.

for articles -> $/ { print "$0: $1\n\n" }

每个 $/上的P6文档,"$/是匹配变量...因此通常包含Match类型的对象.".它还提供了其他一些便利,我们将在这里利用这些便利之一,如下所述.

for循环将每个 overall Match对象(对应于示例文件中已由语法成功解析的每个文章)依次绑定到符号$/中. for的阻止.

该模式包含两对括号.这些产生位置捕获".总体Match对象通过位置下标(后缀[]).因此,在for块中,$/[0]$/[1]提供对给定文章的两个位置捕获的访问.但是$0$1也是这样-因为为了方便起见,标准P6将后面的这些符号都别名为$/[0]$/[1].


还是和我在一起吗?

此答案的后半部分逐步建立,并彻底解释了基于语法的方法.阅读它可能会提供对上述解决方案的进一步了解.

但是首先...

无聊"的实用答案

我想用Perl 6解析它.有人可以帮忙吗?

与其他工具相比,P6可能使编写解析器更加乏味.但是,减少乏味仍然是乏味的.而且P6解析目前很慢.

在大多数情况下,当您要解析除最琐碎的文件格式(尤其是几十年来众所周知的格式)之外的任何内容时,实际的答案是查找并使用现有的解析器.

您可以先在modules.perl6.org上搜索'bib'希望找到一个公开共享的"bib"解析模块.围绕非P6库的纯Perl 6一个或一些P6包装器.但是在撰写本文时,没有'bib'的匹配项.

几乎可以肯定已经有一个围兜"解析C库.它可能是最快的解决方案.即使您不了解C ,也很可能可以轻松,优雅地使用打包为C库的外部解析库(以自己的P6代码)使用.如果 NativeCall 解释太多或太少,请考虑访问freenode IRC频道#perl6 并寻求您需要的任何NativeCall帮助.

如果C lib不适合特定的用例,那么您可能仍可以通过使用该软件包及其功能,就好像它是P6软件包一样,其中包含导出的P6函数,类,对象,值等

Perl 5适配器是最成熟的适配器,因此我将以它为例.假设您使用Perl 5的Text :: BibTex软件包,现在希望将Perl 6与Perl 5中现有的Text :: BibTeX :: BibFormat模块一起使用.然后,在Perl 6中,编写如下内容:

use Text::BibTeX::BibFormat:from<Perl5>;
...
@blocks = $entry.format;

第一行是如何告诉P6您希望加载P5模块的方式. (除非已经安装并正常工作Inline::Perl5,否则它将无法工作.但是如果您使用的是流行的Rakudo Perl 6捆绑软件,则应该如此.否则,您至少应该拥有模块安装程序zef,这样您才能运行zef install Inline::Perl5.)

最后一行只是@blocks = $entry->format;行的机械P6转换,来自 Perl 5 Text :: BibTeX :: BibFormat的摘要.

创建P6语法/解析器

好.足够的无聊"实用建议.现在,让我们尝试一些有趣的操作,以创建足以满足您问题示例的P6解析器.

# use Grammar::Tracer;

grammar bib {

    rule TOP           { <article>* }

    rule article       { '@article{' $<id>=<-[,]>+ ','
                            <kv-pairs>
                         '}'
    }

    rule kv-pairs      { <kv-pair>* % ',' }

    rule kv-pair       { $<key>=\w* '={' ~ '}' $<value>=<-[}]>* }

}

有了这个语法,我们现在可以写类似的东西:

die "Maybe use Grammar::Tracer?" unless bib.parsefile: 'derm.bib';

for $<article> { say .<id> ~ ': ' ~ .<kv-pairs><kv-pair>[0]<value> ~ "\n" }

生成与早期高级用户"和基本Perl 6"解决方案完全相同的输出-但使用语法/解析器方法.

围嘴"语法的解释

# use Grammar::Tracer;

如果解析失败,则返回值为Nil. P6不会告诉您距离有多远.您将没有零头绪来说明解析失败的原因.

如果您没有更好的选择(?),那么当语法失败时,use Grammar::Tracer帮助调试(如果尚未安装,请先安装).

grammar bib {

grammar关键字与class类似,但是语法不仅可以像往常一样包含命名为method的名称,还可以包含命名为regex s,token s和rule s的

    rule TOP           {

除非另行指定,否则解析例程将从调用名为TOPrule(或tokenregexmethod)开始.

根据经验,如果您不知道应该使用ruleregextokenmethod进行某些解析,请使用. (与regex模式不同,token不会回溯,因此它们消除了由于回溯而不必要地缓慢运行的风险.)

但是在这种情况下,我使用了rule.像token模式一样,rule也可以避免回溯.但是此外,它们以自然方式将跟随模式中任何原子的空白视为有效.这通常适合于解析树的顶部. (令牌和偶尔的正则表达式通常适合于叶子.)

    rule TOP           { <article>* }

规则末尾的空格表示语法将匹配输入末尾的任意数量的空白.

<article>在此语法中调用另一个命名规则(或令牌/正则表达式/方法).

由于看起来每个围嘴文件都应允许任何数量的文章,所以我添加了*(<article>*末尾的rel ="nofollow noreferrer">零个或多个量词).

    rule article       { '@article{' $<id>=<-[,]>+ ','
                            <kv-pairs>
                         '}'
    }

我有时会制定规则,使其类似于典型输入外观.我试图在这里这样做.

<[...]>是字符类的P6语法,就像传统的正则表达式语法中的[...]一样.它功能更强大,但现在您只需要知道<-[,]>中的-表示取反,即与旧的[^,]语法中的^相同.因此,<-[,]>+尝试匹配一个或多个字符,但都不是,.

$<id>=<-[,]>+告诉P6尝试匹配=右侧的量化原子(即<-[,]>+位),并将结果存储在当前Match对象内的键'id'上.后者将挂在解析树的一个分支上.我们稍后会精确到达.

    rule kv-pairs      { <kv-pair>* % ',' }

此正则表达式代码说明了几种便捷的P6正则表达式功能之一.它表示您要匹配零个或多个以逗号分隔的kv-pair .

(更详细地, %正则表达式中缀运算符要求其左侧的量化原子的匹配项被其右侧的原子隔开.)

    rule kv-pair       { $<key>=\w* '={' ~ '}' $<value>=<-[}]>* }

这里的新位是'={' ~ '}'.这是另一个方便的正则表达式功能. regex Tilde运算符解析带分隔符的结构(在这种情况下,其结构为={分隔符之间的位与分隔符右侧的量化正则表达式原子匹配.这可以带来很多好处,但是主要的好处是错误消息可以更加清晰.

对解析树的构造/解构的解释

最后一行(for $<article> { say .<id> ~ ':' ~ .<kv-pairs><kv-pair>[0]<value> ~ "\n" })中的$<article>.<id>等位是指匹配对象,该对象存储在成功解析后生成并返回的解析树中.

返回语法顶部

    rule TOP           {

如果解析成功,则返回单个'TOP'级别匹配对象,该对象对应于解析树顶部. (通过变量$/,也可以在解析方法调用之后立即对其进行编码.)

但是在解析的最终返回结果发生之前,代表整个解析子部分的许多 other Match对象将被生成并添加到解析树中.通过将单个生成的Match对象或它们的列表分配给 Associative 元素,如下所述.

    rule TOP           { <article>* }

<article>这样的规则调用有两个作用.首先,P6尝试匹配该规则.其次,如果匹配,则P6会生成一个相应的Match对象,并将其添加到解析树中.

如果成功匹配的模式只是<article>而不是<article>*,则将仅尝试进行一次匹配,并且仅生成一个值(即单个Match对象)并将其添加到解析树中.

但是模式是<article>*,而不仅仅是<article>.因此,P6尝试多次匹配article规则.如果至少匹配一次,则它将生成并存储一个或多个Match对象的相应 list . (有关详细说明,请参见我对如何在比赛中访问捕获内容?" 的回答.)

因此,将匹配对象列表分配给TOP级别匹配对象的'article'键. (如果匹配的正则表达式只是<article>而不是<article>*,则匹配将导致仅将一个Match对象分配给'article'键,而不是它们的列表.)

所以现在我将尝试解释最后一行代码的$<article>部分:

for $<article> { say .<id> ~ ': ' ~ .<kv-pairs><kv-pair>[0]<value> ~ "\n" }

$<article>$/.<article>的缩写.

每个 $/上的P6文档,"$/是匹配变量.它存储了最后一个Regex匹配的结果,因此通常包含Match类型的对象..

在我们的案例中,最后一个正则表达式匹配是bib语法中的TOP规则.

因此,$<article>是解析返回的TOP级别匹配对象的'article'键下的值.此值是3个文章"级别匹配对象的列表.

    rule article       { '@article{' $<id>=<-[,]>+ ','

article正则表达式反过来在分配的左侧包含$<id>.这对应于将Match对象分配给添加到 article level Match对象的新'id'键.

希望这足够了(也许太多了!),我现在可以解释一下最后一行代码,再次是:

for $<article> { say .<id> ~ ': ' ~ .<kv-pairs><kv-pair>[0]<value> ~ "\n" }

for遍历在解析过程中生成并存储在TOP级Match对象的'article'键下的3个Match对象(对应于输入中的3个文章)的列表.

(此迭代将自动将这三个子Match对象的每一个分配给$_,又名"it"或"thetopic",然后,在每次分配后,在块({ ... })中执行代码.该代码块中的代码通常会显式或隐式地引用$_.)

该块中的.<id>位等效于$_.<id>,即它隐式引用了$_.如前所述,$_是这次在for循环周围正在处理的article级别Match对象. <id>位表示.<id>返回存储在article级别Match对象的'id'键下的Match对象.

最后,.<kv-pairs><kv-pair>[0]<value>位引用存储在Match对象的'value'键下的Match对象,该对象作为Match对象的kv-pair键下存储的Match对象列表的第一个(第0个)元素对应于kv-pairs规则的对象,该规则又存储在article级别匹配对象的'kv-pairs'键下.

Ph!

自动生成的解析树不是您想要的

似乎以上所有条件还不够,我需要再提一件事.

语法分析树强烈反映了语法的隐式树结构.但是,由于进行解析而获得这种结构有时是不便的-人们可能想要一个不同的树结构,或者可能是一个更简单的树,或者是一些非树数据结构.

当自动结果不合适时,从解析中准确生成所需内容的主要机制是使用 Action类与语法分开.)

make的主要用例是生成.bib file for reference management while writing my thesis in LaTeX:

@article{garg2017patch,
  title={Patch testing in patients with suspected cosmetic dermatitis: A retrospective study},
  author={Garg, Taru and Agarwal, Soumya and Chander, Ram and Singh, Aashim and Yadav, Pravesh},
  journal={Journal of Cosmetic Dermatology},
  year={2017},
  publisher={Wiley Online Library}
}

@article{hauso2008neuroendocrine,
  title={Neuroendocrine tumor epidemiology},
  author={Hauso, Oyvind and Gustafsson, Bjorn I and Kidd, Mark and Waldum, Helge L and Drozdov, Ignat and Chan, Anthony KC and Modlin, Irvin M},
  journal={Cancer},
  volume={113},
  number={10},
  pages={2655--2664},
  year={2008},
  publisher={Wiley Online Library}
}

@article{siperstein1997laparoscopic,
  title={Laparoscopic thermal ablation of hepatic neuroendocrine tumor metastases},
  author={Siperstein, Allan E and Rogers, Stanley J and Hansen, Paul D and Gitomirsky, Alexis},
  journal={Surgery},
  volume={122},
  number={6},
  pages={1147--1155},
  year={1997},
  publisher={Elsevier}
}

If anyone wants to know what bib file is, you can find it detailed here.

I'd like to parse this with Perl 6 to extract the key along with the title like this:

garg2017patch: Patch testing in patients with suspected cosmetic dermatitis: A retrospective study

hauso2008neuroendocrine: Neuroendocrine tumor epidemiology

siperstein1997laparoscopic: Laparoscopic thermal ablation of hepatic neuroendocrine tumor metastases

Can you please help me to do this, maybe in two ways:

  1. Using basic Perl 6
  2. Using a Perl 6 Grammar

解决方案

This answer is aimed at being both:

  • An introductory general answer to "I want to parse X with Perl 6. Can anyone help?"

  • A complete and detailed answer that does exactly as @Suman asks.


In a single statement (power user)

"$_[0]: $_[1]\n" .put
  for (slurp 'derm.bib')
    ~~ m:g/ '@article{' (<-[,]>+) ',' \s+ 'title={' ~ '}' (<-[}]>+) /

(Run this code at glot.io.)

I decided to start with the sort of thing a dev familiar with P6 would write in a few minutes to do just the simple task you've specified in your question if they didn't much care about readability for newbies.

I'm not going to provide an explanation of it. It just does the job. If you're a P6 newbie it could well be overwhelming. If so, please read the rest of my answer -- it takes things slower and has comprehensive commentary. Perhaps return here and see if it makes more sense after reading the rest.

A "basic Perl 6" solution

my \input      = slurp 'derm.bib' ;

my \pattern    = rule { '@article{'       ( <-[,]>+ ) ','
                          'title={' ~ '}' ( <-[}]>+ ) }

my \articles   = input.match: pattern, :global ;

for articles -> $/ { print "$0: $1\n\n" }

This is almost identical to the "single statement (power user)" code -- broken into four statements rather than one. I could have made it more closely copy the first version of the code but have instead made a few changes that I'll explain. I've done this to make it clearer that P6 deliberately has its features be a scalable and refactorable continuum so one can mix and, er, match whatever features best fits a given use case.

my \input      = slurp 'derm.bib' ;

Perls are famous for their sigils. In P6, if you don't need them you can "slash" them out. Perls are also famous for having terse ways of doing things. slurp reads a file in its entirety in one go.

my \pattern    = rule { '@article{'       ( <-[,]>+ ) ','
                          'title={' ~ '}' ( <-[}]>+ ) }

Perl 6 patterns are generically called regexes or Rules. There are several types of regexes/rules. The pattern language is the same; the distinct types just direct the matching engine to modify how it handles a given pattern.

One regex/rule type is the P6 equivalent of classic regexes. These are declared with either /.../ or regex {...}. The regex in the opening "power user" code was one of these regexes. Their distinction is that they backtrack when necessary, just like classic regexes.

There's no need for backtracking to match the .bib format. Unless you need backtracking, it's wise to consider using one of the other rule types instead. I've switched to a rule declared with the keyword rule.

A rule declared with rule is identical to one declared with regex (or /.../) except that A) it doesn't backtrack and B) it interprets spaces in its pattern as corresponding to possible spaces in the input. Did you spot that I'd dropped the \s+ from the pattern immediately before 'title={'? That's because a rule takes care of that automatically.

The other difference is that I wrote:

'title={' ~ '}' ( ... )

instead of:

'title={' ( ... ) '}'

i.e. moving the pattern matching the bit between the braces after the braces and putting a ~ in between the braces instead. They match the same overall pattern. I could have written things either way in the power user /.../ pattern and either way in this section's rule pattern. But I wanted this section to be a bit more "best practice" oriented. I'll defer a full explanation of this difference and all the other details of this pattern until the Explanation of 'bib' grammar section below.

my \articles   = input.match: pattern, :global ;

This line uses the method form of the m routine used in the earlier "power user" version.

:global is the same as :g. I could have written it either way in both versions.

Add :global (or :g) to the argument list when invoking the .match method (or m routine) if you want to search the entire string being matched, finding as many matches as there are, not just the first. The method (or m routine) then returns a list of Match objects rather than just one. In this case we'll get three, corresponding to the three articles in the input file.

for articles -> $/ { print "$0: $1\n\n" }

Per P6 doc on $/, "$/ is the match variable ... so usually contains objects of type Match.". It also provides some other conveniences and we take advantage of one of these conveniences here as explained next.

The for loop successively binds each of the overall Match objects (corresponding to each of the articles in your sample file that were successfully parsed by the grammar) to the symbol $/ inside the for's block.

The pattern contains two pairs of parentheses. These generate "Positional captures". The overall Match object provides access to its two Positional captures via Positional subscripting (postfix []). Thus, within the for block, $/[0] and $/[1] provide access to the two Positional captures for a given article. But so do $0 and $1 -- because standard P6 aliases these latter symbols to $/[0] and $/[1] for your convenience.


Still with me?

The latter half of this answer builds up and thoroughly explains a grammar-based approach. Reading it may provide further insight into the solutions above.

But first...

A "boring" practical answer

I want to parse this with Perl 6. Can anyone help?

P6 may make writing parsers less tedious than with other tools. But less tedious is still tedious. And P6 parsing is currently slow.

In most cases, the practical answer when you want to parse anything beyond the most trivial of file formats -- especially a well known format that's several decades old -- is to find and use an existing parser.

You might start with a search for 'bib' on modules.perl6.org in the hope of finding a publicly shared 'bib' parsing module. Either a pure Perl 6 one or some P6 wrapper around a non-P6 library. But at the time of writing this there are no matches for 'bib'.

There's almost certainly a 'bib' parsing C library already available. And it's likely to be the fastest solution. It's also likely that you can easily and elegantly use an external parsing library packaged as a C lib, in your own P6 code, even if you don't know C. If NativeCall is either too much or too little explanation, consider visiting the freenode IRC channel #perl6 and asking for whatever NativeCall help you need or want.

If a C lib isn't right for a particular use case then you can probably still use packages written in Perl 5, Python, Ruby, Lua, etc. via their Inline::* language adapters. Just install the Perl 5, Python or whatever package that you want; make sure it runs using that other language; install the appropriate language adapter; then use the package and its features as if it were a P6 package containing exported P6 functions, classes, objects, values, etc.

The Perl 5 adapter is the most mature so I'll use that as an example. Let's say you use Perl 5's Text::BibTex packages and now wish to use Perl 6 with the existing Text::BibTeX::BibFormat module from Perl 5. First, setup the Perl 5 packages as they are supposed to be per their README's etc. Then, in Perl 6, write something like:

use Text::BibTeX::BibFormat:from<Perl5>;
...
@blocks = $entry.format;

The first line is how you tell P6 that you wish to load a P5 module. (It won't work unless Inline::Perl5 is already installed and working. But it should be if you're using a popular Rakudo Perl 6 bundle. And if not, you should at least have the module installer zef so you can run zef install Inline::Perl5.)

The last line is just a mechanical P6 translation of the @blocks = $entry->format; line from the SYNOPSIS of the Perl 5 Text::BibTeX::BibFormat.

Creating a P6 grammar / parser

OK. Enough "boring" practical advice. Let's now try have some fun creating a P6 parser good enough for the example from your question.

# use Grammar::Tracer;

grammar bib {

    rule TOP           { <article>* }

    rule article       { '@article{' $<id>=<-[,]>+ ','
                            <kv-pairs>
                         '}'
    }

    rule kv-pairs      { <kv-pair>* % ',' }

    rule kv-pair       { $<key>=\w* '={' ~ '}' $<value>=<-[}]>* }

}

With this grammar in place, we can now write something like:

die "Maybe use Grammar::Tracer?" unless bib.parsefile: 'derm.bib';

for $<article> { say .<id> ~ ': ' ~ .<kv-pairs><kv-pair>[0]<value> ~ "\n" }

to generate exactly the same output as with the earlier "power user" and "basic Perl 6" solutions -- but using a grammar / parser approach.

Explanation of 'bib' grammar

# use Grammar::Tracer;

If a parse fails, the return value is Nil. P6 won't tell you how far it got. You'll have zero clue why your parse failed.

If you don't have a better option (?), then, when your grammar fails, use Grammar::Tracer to help debug (installing it first if you don't already have it installed).

grammar bib {

The grammar keyword is like class, but a grammar can contain not just named methods as usual but also named regexs, tokens, and rules.

    rule TOP           {

Unless you specify otherwise, parsing routines start out by calling the rule (or token, regex, or method) named TOP.

As a, er, rule of thumb, if you don't know if you should be using a rule, regex, token, or method for some bit of parsing, use a token. (Unlike regex patterns, tokens don't backtrack so they eliminate the risk of unnecessarily running slowly due to backtracking.)

But in this case I've used a rule. Like token patterns, rules also avoid backtracking. But in addition they take whitespace following any atom in the pattern to be significant in a natural manner. This is typically appropriate towards the top of the parse tree. (Tokens, and the occasional regex, are typically appropriate towards the leaves.)

    rule TOP           { <article>* }

The space at the end of the rule means the grammar will match any amount of whitespace at the end of the input.

<article> invokes another named rule (or token/regex/method) in this grammar.

Because it looks like one should allow for any number of articles per bib file, I added a * (zero or more quantifier) at the end of <article>*.

    rule article       { '@article{' $<id>=<-[,]>+ ','
                            <kv-pairs>
                         '}'
    }

I sometimes lay rules out to resemble the way typical input looks. I tried to do so here.

<[...]> is the P6 syntax for a character class, like[...] in traditional regex syntax. It's more powerful but for now all you need to know is that the - in <-[,]> indicates negation, i.e. the same as the ^ in ye olde [^,] syntax. So <-[,]>+ attempts a match of one or more characters, none of which are ,.

$<id>=<-[,]>+ tells P6 to attempt to match the quantified atom on the right of the = (i.e. the <-[,]>+ bit) and store the results at the key 'id' within the current Match object. The latter will be hung from a branch of the parse tree; we'll get to precisely where later.

    rule kv-pairs      { <kv-pair>* % ',' }

This regex code illustrates one of several convenient P6 regex features. It says you want to match zero or more kv-pairs separated by commas.

(In more detail, the % regex infix operator requires that matches of the quantified atom on its left are separated by the atom on its right.)

    rule kv-pair       { $<key>=\w* '={' ~ '}' $<value>=<-[}]>* }

The new bit here is '={' ~ '}'. This is another convenient regex feature. The regex Tilde operator parses a delimited structure (in this case one with a ={ opener and } closer) with the bit between the delimiters matching the quantified regex atom on the right of the closer. This confers several benefits but the main one is that error messages can be much clearer.

An explanation of the parse tree's construction/deconstruction

The $<article> and .<id> etc. bits in the last line (for $<article> { say .<id> ~ ':' ~ .<kv-pairs><kv-pair>[0]<value> ~ "\n" }) refer to Match objects that are stored in the parse tree that's generated and returned from a successful parse.

Returning to the top of the grammar:

    rule TOP           {

If a parse is successful, a single 'TOP' level Match object, the one corresponding to the top of the parse tree, is returned. (It's also made available to code immediately following the parse method call via the variable $/.)

But before that final return from parsing happens, many other Match objects, representing sub parts of the overall parse, will have been generated and added to the parse tree. Addition of Match objects to a parse tree is done by assigning either a single generated Match object, or a list of them, to either a Positional or Associative element of a "parent" Match object, as explained next.

    rule TOP           { <article>* }

A rule invocation like <article> has two effects. First, P6 tries to match the rule. Second, if it matches, P6 generates a corresponding Match object and adds it to the parse tree.

If the successfully matched pattern had been just <article>, rather than <article>*, then only one match would have been attempted and only one value, a single Match object, would have been generated and added to the parse tree.

But the pattern was <article>*, not merely <article>. So P6 attempts to match the article rule multiple times. If it matches at least once then it generates and stores a corresponding list of one or more Match objects. (See my answer to "How do I access the captures within a match?" for a more detailed explanation.)

So a list of Match objects is assigned to the 'article' key of the TOP level Match object. (If the matching regex expression had been just <article> rather than <article>* then a match would result in just a single Match object being assigned to the 'article' key rather than a list of them.)

So now I'll try to explain the $<article> part of the last line of code, which was:

for $<article> { say .<id> ~ ': ' ~ .<kv-pairs><kv-pair>[0]<value> ~ "\n" }

$<article> is short for $/.<article>.

Per P6 doc on $/, "$/ is the match variable. It stores the result of the last Regex match and so usually contains objects of type Match.".

The last Regex match in our case was the TOP rule from the bib grammar.

So $<article> is the value under the 'article' key of the TOP level Match object returned by the parse. This value is a list of 3 'article' level Match objects.

    rule article       { '@article{' $<id>=<-[,]>+ ','

The article regex in turn contains $<id> on the left side of an assignment. This corresponds to assigning a Match object to a new 'id' key added to the article level Match object.

Hopefully this is enough (perhaps too much!) and I can now explain the last line of code, which, once again, was:

for $<article> { say .<id> ~ ': ' ~ .<kv-pairs><kv-pair>[0]<value> ~ "\n" }

The for iterates over the list of 3 Match objects (corresponding to the 3 articles in the input) that were generated during the parse and stored under the 'article' key of the TOP level Match object.

(This iteration automatically assigns each of these three sub Match objects to $_, aka "it" or "the topic", and then, after each assignment, does the code in the block ({ ... }). The code in the block will typically refer, either explicitly or implicitly, to $_.)

The .<id> bit in the block is equivalent to $_.<id>, i.e. it implicitly refers to $_. As just explained, $_ is the article level Match object being processed this time around the for loop. The <id> bit means .<id> returns the Match object stored under the 'id' key of the article level Match object.

Finally, the .<kv-pairs><kv-pair>[0]<value> bit refers to the Match object stored under the 'value' key of the Match object stored as the first (0th) element of the list of Match objects stored under the kv-pair key of the Match object corresponding to the kv-pairs rule which in turn is stored under the 'kv-pairs' key of an article level Match object.

Phew!

When the automatically generated parse tree isn't what you want

As if all the above were not enough, I need to mention one more thing.

The parse tree strongly reflects the implicit tree structure of the grammar. But getting this structure as a result of a parse is sometimes inconvenient -- one may want a different tree structure instead, perhaps a much simpler tree, perhaps some non-tree data structure.

The primary mechanism for generating exactly what you want from a parse when the automatic results aren't suitable is use of make. (This can be used in code blocks inside rules or factored out into Action classes that are separate from grammars.)

In turn, the primary use case for make is to generate a sparse tree of nodes hanging off the parse tree.

Finally, the primary use case for these sparse trees is storing an AST.

这篇关于使用Perl 6从.bib文件中提取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆