给定不完全包含在单词中的标记,我如何标记一个单词? [英] How do I tokenise a word given tokens that are subsumed incompletely in the word?

查看:31
本文介绍了给定不完全包含在单词中的标记,我如何标记一个单词?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我了解如何通过以下方式在 Perl 中使用正则表达式:

I understand how to use regex in Perl in the following way:

$str =~ s/expression/replacement/g;

我明白如果表达式的任何部分被括在括号中,它可以在替换部分中使用和捕获,如下所示:

I understand that if any part of the expression is enclosed in parentheses, it can be used and captured in the replacement part, like this:

$str =~ s/(a)/($1)dosomething/;

但是有没有办法在正则表达式的之外捕获($1)?

But is there a way to capture the ($1) above outside of the regex expression?

我有一个完整的单词,它是一串辅音,例如bEdmA,它的元音版本baEodamaA(其中ao 是元音),以及它的分裂形式两个标记,用空格隔开,bEd maA.我只想从整个单词中提取标记的元音形式,如下所示:beEoda, maA.我试图在完整的单词表达式中捕获标记,所以我有:

I have a full word which is a string of consonants, e.g. bEdmA, its vowelized version baEodamaA (where a and o are vowels), as well its split up form of two tokens, separated by space, bEd maA. I want to just pick up the vowelized form of the tokens from the full word, like so: beEoda, maA. I'm trying to capture the token within the full word expression, so I have:

$unvowelizedword = "bEdmA";
$tokens[0] = "bEd", $tokens[1] = "mA";
$vowelizedword = "baEodamA";

foreach $t(@tokens) {
    #find the token within the full word, and capture its vowels
}

我正在尝试做这样的事情:

I'm trying to do something like this:

$vowelizedword = m/($t)/;

这是完全错误的,原因有两个:标记 $t 并没有以其自己的形式出现,例如 bEd,而是类似于 m/bEd/ 会更相关.另外,如何在正则表达式外部的变量中捕获它?

This is completely wrong for two reasons: the token $t is not present in exactly its own form, such as bEd, but something like m/b.E.d/ would be more relevant. Also, how do I capture it in a variable outside the regular expression?

真正的问题是:给定标记bEdmA<,我如何捕获元音序列baEodamaA/code> 来自完整的词 beEodamaA?

The real question is: how can I capture the vowelized sequences baEoda and maA, given the tokens bEd, mA from the full word beEodamaA?

我从所有的答案中意识到我遗漏了两个重要的细节.

I realized from all the answers that I missed out two important details.

  1. 元音是可选的.因此,如果标记是:Al"和ywm",并且完全元音化的单词是Alyawmi",那么输出标记将是Al"和yawmi".
  2. 我只提到了两个元音,但还有更多,包括由两个字符组成的符号,如'~a'.完整列表(虽然我认为我不需要在这里提及)是:

  1. Vowels are optional. So if the tokens are : "Al" and "ywm", and the fully vowelized word is "Alyawmi", then the output tokens would be "Al" and "yawmi".
  2. I only mentioned two vowels, but there are more, including symbols made up of two characters, like '~a'. The full list (although I don't think I need to mention it here) is:

@vowels = ('a', 'i', 'u', 'o', '~', '~a', '~i', '~u', 'N', 'F','K', '~N', '~K');

@vowels = ('a', 'i', 'u', 'o', '~', '~a', '~i', '~u', 'N', 'F', 'K', '~N', '~K');

推荐答案

以下内容似乎符合您的要求:

The following seems to do what you want:

#!/usr/bin/env perl
use warnings;
use strict;

my @tokens = ('bEd', 'mA');
my $vowelizedword = "beEodamaA";

my @regex = map { join('.?', split //) . '.?' } @tokens;

my $regex = join('|', @regex);
$regex = qr/($regex)/;

while (my ($matched) = $vowelizedword =~ $regex) {
    $vowelizedword =~ s{$regex}{};
    print "matched $matched\n";
}

根据您更新的问题进行更新(元音是可选的).它从字符串的末尾开始工作,因此您必须将标记收集到一个数组中并反向打印它们:

Update as per your updated question (vowels are optional). It works from the end of the string so you'll have to gather the tokens into an array and print them in reverse:

#!/usr/bin/env perl
use warnings;
use strict;

my @tokens = ('bEd', 'mA', 'Al', 'ywm');
my $vowelizedword = "beEodamaA Alyawmi"; # Caveat: Without the space it won't work.

my @regex = map { join('.?', split //) . '.?$' } @tokens;

my $regex = join('|', @regex);
$regex = qr/($regex)/;

while (my ($matched) = $vowelizedword =~ $regex) {
        $vowelizedword =~ s{$regex}{};
            print "matched $matched\n";
}

这篇关于给定不完全包含在单词中的标记,我如何标记一个单词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆