Raku语法的记号不会出现在文档的第一个出现处,但是会出现在类似的随后出现处 [英] The token of raku grammar doesn't hit the first occurences of a document but hits the similar following occurences

查看:56
本文介绍了Raku语法的记号不会出现在文档的第一个出现处,但是会出现在类似的随后出现处的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用希伯来语处理整个Tanach文件.为此,我选择了Raku语言,因为它具有某些功能(语法和Unicode支持).

I want to process the whole Tanach file, in Hebrew. For that, I chose the language Raku because some of its features (grammar and unicode support).

因此,我定义了一些标记来选择相关数据.

So, I defined some tokens to select the relevant data.

grammar HEB {
        token TOP {'<hebrewname>'<t_word>'</hebrewname>'}
        token t_word {<graph>+}
};

grammar CHA {
        token TOP {'<c n="'<t_number>'">'}
        token t_number {\d+}
};

grammar VER {
        token TOP {'<v n="'<t_number>'">'}
        token t_number {\d+}
};

grammar WOR {
        token TOP {'<w>'<t_word>'</w>'}
        token t_word {<graph>+}
};

在这里,文档的很小一部分(XML格式的Tanach)足以显示问题:

Here, a very small part the document (the Tanach in XML format) which is sufficient show the problem :

< names> < name>创世纪</name> < abbrev> Gen</abbrev> < number> 1</number> < filename>创世纪</filename> < hebrewname>בראשית</hebrewname> </names> < c n ="1"> < v n ="1"> < w>בְּ/רֵאשִׁ֖ית</w> < w>בָּרָ֣א</w> < w>אֱלֹהִ֑ים</w> < w>אֵ֥ת</w> < w>הַ/שָּׁמַ֖יִם</w> < w>וְ/אֵ֥ת</w> < w>הָ/אָֽרֶץ׃</w> </v> < v n ="2"> < w>וְ/הָ/אָ֗רֶץ</w> < w>הָיְתָ֥ה</w> < w>תֹ֙הוּ֙</w> < w>וָ/בֹ֔הוּ</w> < w>וְ/חֹ֖שֶׁךְ</w> < w> עַל־</w> < w>פְּנֵ֣י</w> < w>תְה֑וֹם</w> < w>וְ/ר֣וּחַ</w> < w>אֱלֹהִ֔ים</w> < w>מְרַחֶ֖פֶת</w> < w> עַל־</w> < w>פְּנֵ֥י</w> < w>הַ/מָּֽיִם׃</w> </v>

问题是该代码无法识别两个头字(< w>בְּ/רֵאשִׁ֖ית</w> < w>בָּרָ֣א</w> ),但似乎可以与以下单词配合使用...有人可以向我解释怎么了?

The problem is that the code doesn't recognize the two first words (<w>בְּ/רֵאשִׁ֖ית</w> <w>בָּרָ֣א</w> ) but seems to work fine with the following words... Somebody could explain to me what's wrong ?

主循环为:

for $file_in.lines -> $line {
    $memline = $line.trim;

    if HEB.parse($memline) {
          say "hebrew name of book is "~ $/<t_word>;
          next;
    }
    if CHA.parse($memline) {
        say "chapitre number is "~ $/<t_number>;
        next;
    }
    if VER.parse($memline) {
        say "verse number is "~ $/<t_number>;
        next;
    }
    if WOR.parse($memline) {
        $computed_word_value = 0;
        say "word is "~ $/<t_word>;
        $file_out.print("$/<t_word>");
        say "numbers of graphemes of word is "~ $/<t_word>.chars;
        @exploded_word = $/<t_word>.comb;
        for @exploded_word {
                say $_.uniname;
        };
        next;
    }
    say "not processed";
}

输出文件:

请注意,经数为1 之后,将不处理前2个单词.不要专注于变形的希伯来语(Windows控制台)!

Please note that after verse number is 1, the 2 first words are not processed. Don't be focused on the distorted Hebrew (windows console) !

not processed
not processed
not processed
not processed
not processed
hebrew name of book is ׳‘׳¨׳׳©׳™׳×
not processed
chapitre number is 1
verse number is 1
not processed
not processed
word is ׳ײ±׳œײ¹׳"ײ´ײ‘׳™׳
numbers of graphemes of word is 5
HEBREW LETTER ALEF
HEBREW LETTER LAMED
HEBREW LETTER HE
HEBREW LETTER YOD
HEBREW LETTER FINAL MEM
word is ׳ײµײ¥׳×
numbers of graphemes of word is 2
HEBREW LETTER ALEF
HEBREW LETTER TAV
not processed
word is ׳•ײ°/׳ײµײ¥׳×
numbers of graphemes of word is 4
HEBREW LETTER VAV
SOLIDUS

我希望我的问题清楚地暴露出来.

I hope that my question is clearly exposed.

推荐答案

我无法重现您的问题.
我唯一能猜到的是您没有使用正确的编码打开文件.

I can't reproduce your problem.
About the only thing I can guess is that you didn't open the file with the correct encoding.

或更糟糕的是,您是从STDIN获取文件的,而没有选择正确的代码页.(这很有意义,因为您的输出也是mojibake.)
Rakudo并没有真正执行代码页,因此,如果您未将环境设置为utf8,则必须将 $ * STDIN (和 $ * STDOUT )的编码更改为匹配所有内容.

Or worse, you are getting the file from STDIN and don't have the proper codepage selected. (Which makes sense since your output is also mojibake.)
Rakudo doesn't really do codepages, so if you don't set your environment to utf8 you have to change the encoding of $*STDIN (and $*STDOUT) to match whatever it is.

我现在假装是您发布到CodeReview.StackExchange.com.

I'm now going to pretend that you posted to CodeReview.StackExchange.com instead.

首先,我不知道您为什么要为这么小的东西创建一个完整的语法,而使用简单的正则表达式就可以轻松地完成它.

First I don't know why you are creating a whole grammar for something so small which could easily be done with simple regexes.

my token HEB {
  '<hebrewname>'
  $<t_word> = [<.graph>+]
  '</hebrewname>'
}
my token CHA {
 '<c n="' $<t_number> = [\d+] '">'
}
my token VER {
  '<v n="' $<t_number> = [\d+] '">'
}
my token WOR {
  '<w>' $<t_word> = [<.graph>+] '</w>'
}

老实说,这仍然超出您的需要,因为每个正则表达式只处理一个元素.

Honestly that is still more than you seem to need, as you only deal with one element per regex.

这也忽略了我真的不喜欢您给元素命名,例如 t_word t_number .这是没有意义的,因为它们位于 $/内部,并且Grammar也没有任何类似的命名方法,因此它们不会干扰任何其他命名空间.如果必须给他们起描述性的名字,请给他们起名字.

That's also ignoring that I really dislike that you are giving the elements names like t_word and t_number. Which is pointless as they are inside of $/, and Grammar also doesn't have any such similarly named method so there is no chance of them interfering with any other namespace. Give them descriptive names if you must give them names.

您可以只将 $/限制为仅使用<((...)> )字符串化到您关心的部分.(它在这里起作用,因为您仅捕获一件事.)

You can just restrict $/ to only stringifying to the part you care about with <(…)>. (It works here because you are only capturing one thing.)

<(表示忽略之前的所有内容,而)> 表示忽略之后的所有内容.

<( means ignore everything before, and )> means ignore everything after.

my token HEB {
  '<hebrewname>'
  <( <.graph>+ )> # $/ will contain only what <.graph>+ matches
  '</hebrewname>'
}
my token CHA {
 '<c n="' <( \d+ )> '">'
}
my token VER {
  '<v n="' <( \d+ )> '">'
}
my token WOR {
  '<w>' <( <.graph>+ )> '</w>'
}


您正在解析它,就好像它只是一个面向行的文件一样.
格式为1的格式确实具有一定意义,并减少了内存使用量.


You are parsing it as if it was just a line oriented file.
Which does make a certain amount of sense as it is formatted as one, and that results in less memory usage.

为此使用命名的正则表达式,更不用说整个语法了.当这种简单的匹配实际上并不需要时,它也将逻辑分开.

Using named regexes for that, let alone whole grammars is a bit overkill. It also separates the logic when that isn't really necessary for such simple matches.

这是我以面向行的方式解析该文件的方式:

Here is how I would parse that file in a line oriented fashion:

my $in-names = False;
my %names;
my @chapters;
my @verses;
my @current-verse;

for $file_in.lines {
  when /'<names>' / { $in-names = True  }
  when /'</names>'/ { $in-names = False }

  # chapter
  when /'<c n="' <( \d+ )> '">'/ {
    @verses := @chapters[ +$/ - 1 ] //= [];
  }
  when /'</c>'/ {
    # finalize this chapter
    # for example print out statistics
    # (only needed if you don't want `default` to catch it)
  }

  # verse
  when /'<v n="' <( \d+ )> '">'/ {
    @current-verse := @verses[ +$/ - 1 ] //= [];
  }
  when /'</v>'/ {
    # finalize this verse
  }

  # word
  when /'<w>' <( <.graph>+ )> '</w>'/ {
    push @current-verse, ~$/;
  }

  # name tags
  # must be after more specific regexes
  when /'<' <tag=.ident> '>' $<value> = [<.ident>|\d+] {} "</$<tag>>"/ {
    if $in-names {
      %names{~$<tag>} = ~$<value>
    } else {
      note "not handling $<tag> => $<value> outside of <names>"
    }
  }

  default { note "unexpected text '$_'" }
}

请注意,何时成功使您不必执行 next .
而且由于我们只使用 $ _ 而不是 $ line ,所以它使我们可以直接将正则表达式用作这些 when 的条件声明.

Note that when makes it so that you don't have to do next.
And since we just use $_ instead of $line, it makes it so that we can just use regexes directly as the condition of those when statements.

我不介意使用 ^ $ ,因此无需进行修剪或使用 ^ \ s* \ s * $ .
它确实使它更加脆弱,因此如果出现问题,您可能需要对其进行更改.

I'm not bothering to use ^ or $ so there is no need to either trim or use ^\s* and \s*$.
It does make it a bit more fragile, so you may want to change it if it becomes a problem.

如果您真的只想像您一样做简单的行处理,我相信您可以根据需要更改上面的内容.

If you really want to just do simple line processing like you're doing, I'm sure you can alter the above to suit your needs.

我想让此功能对以后遇到此问题的人们更有用.因此,我从文件中创建了一个数据结构,而不是遵循您的工作.

I wanted to make this more useful to people who come across this in the future. So I created a data structure from the file instead of following what you were doing.

真的,如果我要一次性整个文件 .parse()的话,我可能只会了解语法.

Really I probably only would have reached for a grammar if I were going to .parse() the entire file in one go.

这就是这样的语法.

grammar Book {
  rule TOP {
    <names>
    <chapter> +
    # note that there needs to be a space between <chapter> and +
    # so that whitespace can be between <c…>…</c> elements
  }

  rule names {
    '<names>'  ~  '</names>'
    <name> +
  }

  token name {
    '<' <tag=.ident> '>'
    $<name> = [<.ident>|\d+]
    {}
    "</$<tag>>"
  }

  rule chapter {
    # note space before ]
    ['<c n="' <number> '">' ]  ~  '</c>'
    <verse> +
  }
  rule verse {
    ['<v n="' <number> '">' ]  ~  '</v>'
    <word> +
  }

  token number { \d+ }
  token word { '<w>' <( <.graph>+ )> '</w>' }
}

要进行与您以前类似的处理

To do similar processing as you have been doing

class Line-Actions {
  has IO::Handle:D $.file-out is required;
  has $!number-type is default<chapter>;

  method name ($/) {
    if $<tag> eq 'hebrewname' {
      say "hebrew name of book is $<name>";
    }
  }

  # note that .chapter and .verse will run at the end
  # of parsing them, which is too late for when .word is processed
  # so we do it in .number instead
  method number ($/) {
    say "$!number-type number is $/";
    $!number-type = 'verse';
  }
  method chapter ($/) {
    # reset to default of "chapter"
    # as the next .number will be for the next chapter
    $!number-type = Nil;
  }

  method word ($/) {
    say "word is $/";
    $!file-out.print(~$/);
    say "number of graphemes in word is $/.chars()";
    .say for "$/".comb.map: *.uninames.join(', ');
  }
}


Book.parsefile(
  $filename,
  actions => Line-Actions.new( 'outfile.txt'.IO.open(:w) )
);

这篇关于Raku语法的记号不会出现在文档的第一个出现处,但是会出现在类似的随后出现处的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆