如何在 Perl 6 中删除变音符号 [英] How to remove diacritics in Perl 6

查看:43
本文介绍了如何在 Perl 6 中删除变音符号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

两个相关的问题.Perl 6 非常聪明,它将字素理解为一个字符,无论是一个 Unicode 符号(如 äU+00E4)还是两个或更多组合符号(如ḏ̣).这个小代码

我的@symb;@symb.push("ä");@symb.push("p" ~ 0x304.chr);# "p̄"@symb.push("ḏ" ~ 0x323.chr);# "ḏ̣"为@symb 说$_ 有 {$_.chars} 个字符";

给出以下输出:

ä 有 1 个字符p̄ 有 1 个字符ḏ̣ 有 1 个字

但有时我希望能够执行以下操作.1) 从 ä 中删除变音符号.所以我需要一些像

这样的方法

"ä".mymethod → "a"

2) 将组合"符号拆分为多个部分,即将 拆分为 pCombining Macron U+0304.例如.类似于 bash 中的以下内容:

$ echo p̄ |格雷普.-o |wc -l2

解决方案

Perl 6 在 Str 类中具有强大的 Unicode 处理支持.要执行 (1) 中的要求,您可以使用 samemark 方法/例程.

根据文档:

<块引用>

multi sub samemark(Str:D $string, Str:D $pattern --> Str:D)方法 samemark(Str:D: Str:D $pattern --> Str:D)

返回$string 的副本,其中每个字符的标记/重音信息已更改,以便与$pattern 中相应字符的标记/重音匹配.如果 $string$pattern 长,$string 中剩余的字符接收与 中最后一个字符相同的标记/重音$模式.如果 $pattern 为空,则不会进行任何更改.

示例:

say 'åäö'.samemark('aäo');# 输出:«aäo␤»说 'åäö'.samemark('a');# 输出:«aao␤»说 samemark('Pêrl', 'a');# 输出:«Perl␤»说samemark('aöä', '');# 输出:«aöä␤»

这既可用于从字母中删除标记/变音符号,也可用于添加它们.

对于 (2),有几种方法可以做到这一点 (TIMTOWTDI).如果你想要一个字符串中所有代码点的列表,你可以使用 ords 方法获取字符串中所有代码点的 List(技术上是 Positional).

说p̄".ords;# 输出:«(112 772)␤»

您可以使用 uniname获取代码点的 Unicode 名称的方法/例程:

.uniname.say for "p̄".ords;# 输出:«拉丁文小写字母 P␤COMBINING MACRON␤»

或者只使用 uninames方法/例程:

.say for "p̄".uninames;# 输出:«拉丁文小写字母 P␤COMBINING MACRON␤»

如果您只想要字符串中的代码点数,可以使用 <代码>代码:

说p̄".codes;# 输出:«2␤»

这与 chars 不同,它只计算字符串中的字符数:

说p̄".chars;# 输出:«1␤»

另请参阅@hobbs 使用 NFD 的回答.

Two related questions. Perl 6 is so smart that it understands a grapheme as one character, whether it is one Unicode symbol (like ä, U+00E4) or two and more combined symbols (like and ḏ̣). This little code

my @symb;
@symb.push("ä");
@symb.push("p" ~ 0x304.chr); # "p̄" 
@symb.push("ḏ" ~ 0x323.chr); # "ḏ̣"
say "$_ has {$_.chars} character" for @symb;

gives the following output:

ä has 1 character
p̄ has 1 character
ḏ̣ has 1 character

But sometimes I would like to be able to do the following. 1) Remove diacritics from ä. So I need some method like

"ä".mymethod → "a"

2) Split "combined" symbols into parts, i.e. split into p and Combining Macron U+0304. E.g. something like the following in bash:

$ echo p̄ | grep . -o | wc -l
2

解决方案

Perl 6 has great Unicode processing support in the Str class. To do what you are asking in (1), you can use the samemark method/routine.

Per the documentation:

multi sub samemark(Str:D $string, Str:D $pattern --> Str:D)
method    samemark(Str:D: Str:D $pattern --> Str:D)

Returns a copy of $string with the mark/accent information for each character changed such that it matches the mark/accent of the corresponding character in $pattern. If $string is longer than $pattern, the remaining characters in $string receive the same mark/accent as the last character in $pattern. If $pattern is empty no changes will be made.

Examples:

say 'åäö'.samemark('aäo');                        # OUTPUT: «aäo␤» 
say 'åäö'.samemark('a');                          # OUTPUT: «aao␤» 

say samemark('Pêrl', 'a');                        # OUTPUT: «Perl␤» 
say samemark('aöä', '');                          # OUTPUT: «aöä␤» 

This can be used both to remove marks/diacritics from letters, as well as to add them.

For (2), there are a few ways to do this (TIMTOWTDI). If you want a list of all the codepoints in a string, you can use the ords method to get a List (technically a Positional) of all the codepoints in the string.

say "p̄".ords;                  # OUTPUT: «(112 772)␤»

You can use the uniname method/routine to get the Unicode name for a codepoint:

.uniname.say for "p̄".ords;     # OUTPUT: «LATIN SMALL LETTER P␤COMBINING MACRON␤»

or just use the uninames method/routine:

.say for "p̄".uninames;         # OUTPUT: «LATIN SMALL LETTER P␤COMBINING MACRON␤»

If you just want the number of codepoints in the string, you can use codes:

say "p̄".codes;                 # OUTPUT: «2␤»

This is different than chars, which just counts the number of characters in the string:

say "p̄".chars;                 # OUTPUT: «1␤»

Also see @hobbs' answer using NFD.

这篇关于如何在 Perl 6 中删除变音符号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆