支持Unicode的单词搜索-问题 [英] Unicode-ready wordsearch - Question

查看:103
本文介绍了支持Unicode的单词搜索-问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此代码可以吗?我真的不知道我应该使用哪种规范化形式(我唯一注意到的是NFD我得到了错误的输出).

Is this code OK? I don't really have a clue which normalization-form I should us (the only thing I noticed is with NFD I get a wrong output).

#!/usr/local/bin/perl
use warnings;
use 5.014;
use utf8;
binmode STDOUT, ':encoding(utf-8)';

use Unicode::Normalize;
use Unicode::Collate::Locale;
use Unicode::GCString;

my $text = "my taxt täxt";
my %hash;

while ( $text =~ m/(\p{Alphabetic}+(?:'\p{Alphabetic}+)?)/g ) { #'
    my $word = $1;
    my $NFC_word = NFC( $word );
    $hash{$NFC_word}++;
}

my $collator = Unicode::Collate::Locale->new( locale => 'DE' ); 

for my $word ( $collator->sort( keys %hash ) ) {
    my $gcword = Unicode::GCString->new( $word );
    printf "%-10.10s : %5d\n", $gcword, $hash{$word};
}

推荐答案

哇!我不敢相信没有人回答.这是一个超级骗子的好问题.您也几乎是对的.我喜欢您使用Unicode :: Collat​​e :: Locale和Unicode :: GCString.对你有好处!

Wow!! I can’t believe nobody answered this. It’s a super duper great question. You almost had it right, too. I like that you’re using Unicode::Collate::Locale and Unicode::GCString. Good for you!

得到错误"输出的原因是因为您没有使用Unicode :: GCString类的columns方法来确定要打印的内容的打印宽度.

The reason you are getting "wrong" output is because you are not using the Unicode::GCString class's columns method to determine the print width of the stuff you’re printing.

printf非常愚蠢,仅计算代码点而不是列,因此您必须编写自己的pad函数,将GCS列考虑在内.例如,手动进行操作,而不是编写以下代码:

printf is very stupid and just counts code points, not columns, so you have to write your own pad function that takes the GCS columns into account. For example, to do it manually, instead of writing this:

 printf "%-10.10s", $gstring;

您必须编写以下内容:

 $colwidth = $gcstring->columns();
 if ($colwidth > 10) {
      print $gcstring->substr(0,10);
 } else {
     print " " x (10 - $colwidth);
     print $gcstring;
 }

看看它是如何工作的?

现在归一化无关紧要.忽略Kerrek的旧评论.这是非常错误的. UCA专门设计为不让标准化进入问题.您必须向后弯腰而不是向上拧,例如,如果要使用gmatch方法或类似方法,则将normalization => undef传递给构造函数.

Now normalization doesn’t matter. Ignore Kerrek’s old comment. It is very wrong. The UCA is specifically designed not to let normalization enter into the matter. You have to bend over backwards to screw than up, like by passing in normalization => undef to the constructor in case you want to use its gmatch method or some such.

这篇关于支持Unicode的单词搜索-问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆