Perl Unicode内部-与utf8混淆 [英] Perl Unicode internals - mess with utf8

查看:134
本文介绍了Perl Unicode内部-与utf8混淆的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在任何人告诉我RTFM之前,我必须说-我已经研究过了:

Before anyone will tells me to RTFM, I must say - I have digged through:

  • Why does modern Perl avoid UTF-8 by default?
  • Checklist for going the Unicode way with Perl
  • How to match string with diacritic in perl?
  • How to make "use My::defaults" with modern perl & utf8 defaults?
  • and many others (like perluniintro and others) - but - sure, missed something

因此,基本代码:

use 5.014;           #getting 'unicode_strings' feature
use uni::perl;       #turning on many utf8 things
use Unicode::Normalize  qw(NFD NFC);
use warnings;
while(<>) {
    chomp;
    my $data = NFD($_);
    say "OK" if utf8::is_utf8($data);
}

这时,从 utf8 编码的STDIN的中,我在$data中获得了正确的 unicode 字符串,例如"\ w"将匹配多字节[\p{Alphabetic}\p{Decimal_Number}\p{Letter_Number}](也许还有更多).可以,而且可以.

At this point, from the utf8 encoded STDIN I got a correct unicode string in $data, e.g. "\w" will match multibyte [\p{Alphabetic}\p{Decimal_Number}\p{Letter_Number}] (maybe something more). That's ok and works.

AFAIK $data 不包含utf8,而是perl's internal Unicode格式的字符串.

AFAIK $data does not contain utf8, but a string in perl's internal Unicode format.

现在是问题:

  • 我如何确保(测试)任何$other_data包含有效的Unicode字符串?
  • utf8 :: is_utf8($ data)有什么用途?整个 utf8 编译指示对我来说还是个谜.
  • HOW can I ensure (test it), that any $other_data contains valid Unicode string?
  • For what purpose is the utf8::is_utf8($data)? The whole utf8 pragma is a mystery for me.

我了解到use utf8;仅用于告诉Perl我的源代码是在utf8中(所以,从Perl的角度来看,我做的事情与我的脚本以BOM表标记开头-对于BigEndian一样).源代码就像一个外部文件-Perl应该知道它是哪种编码...

I understand that the use utf8; is only for the purpose of telling Perl that my source code is in utf8 (so do similar things as when my script starts with BOM flag - for BigEndian) - from Perl's point of view, my source code is like an external file - and Perl should know in what encoding it is...

在上面的示例中,utf8::is_utf8($data)将显示OK-但我不明白为什么.

In the above example utf8::is_utf8($data) will print OK - but I don't understand WHY.

Perl内部不使用utf8,因此我的utf8数据文件被转换为Perl的内部Unicode,所以为什么utf8::is_utf8($data)对于$data返回true,而不是utf8格式的 ?否则会被错误命名,该函数应命名为uni :: is_unicode($ data)???

Internally Perl does not use utf8, so my utf8 data-file is converted into Perl's internal Unicode, so why does the utf8::is_utf8($data) return true for $data, which is not in utf8 format? Or it is misnamed and the function should be named as uni::is_unicode($data)???

谢谢您的澄清.

Ps:@brian d foy-是的,我仍然没有有效的Perl编程书-我会明白的-我保证:)/joking/

Ps: @brian d foy - yes, I still don't have the Effective Perl Programming book - I will get it - I promise :) /joking/

推荐答案

is_utf8返回有关使用哪种内部存储格式(句点)的信息.

is_utf8 returns information about which internal storage format was used, period.

  • 它与字符串的值无关(尽管某些字符串只能以两种格式之一存储).
  • 与字符串是否已解码无关.
  • 与字符串是否包含使用UTF-8编码的内容无关.
  • 这不是任何形式的有效性检查.

现在继续您的问题.

整个utf8编译指示对我来说还是个谜.
The whole utf8 pragma is a mystery for me.

use utf8;告诉perl您的源代码是使用UTF-8编码的.如果您不这样说,perl实际上会假定它是iso-8859-1(作为内部机制的副作用).

use utf8; tells perl your source code is encoded using UTF-8. If you don't tell it so, perl effectively assumes it's iso-8859-1 (as a side-effect of internal mechanisms).

utf8 ::名称空间中的功能与实用程序无关,它们具有多种用途.

The functions in the utf8:: namespace are unrelated to the pragma, and they serve a variety of purposes.

  • utf8::encodeutf8::decode:有用的编码和解码功能.与Encode的encode_utf8decode_utf8类似,但是它们就地工作.
  • utf8::upgradeutf8::downgrade:很少使用,但对于解决XS模块中的错误很有用.详情请见下文.
  • utf8::is_utf8:我不知道为什么有人会使用它.
  • utf8::encode and utf8::decode: Useful encoding and decoding functions. Similar to Encode's encode_utf8 and decode_utf8, but they work in-place.
  • utf8::upgrade and utf8::downgrade: Rarely used, but useful for working around bugs in XS modules. More on this below.
  • utf8::is_utf8: I don't know why someone would ever use that.
我怎么能保证(测试)比任何$ other_data都包含有效的unicode字符串?
HOW i can ensure (test it), than any $other_data contains valid unicode string?

有效Unicode字符串"对您意味着什么? Unicode在不同情况下具有不同的有效定义.

What does "valid Unicode string" mean to you? Unicode has different definitions of valid for different circumstances.

utf8 :: is_utf8($ data)有什么用途?
for what purpose is the utf8::is_utf8($data)?

调试.它会窥视Perl胆量.

Debugging. It peeks at Perl guts.

在上面的示例中utf8 :: is_utf8($ data)将显示OK-但不理解为什么.
In the above example utf8::is_utf8($data) will print OK - but don't understand WHY.

因为NFD恰好选择返回包含UTF8 = 1格式字符串的标量.

Because NFD happens to have chosen to return a scalar containing a string in the UTF8=1 format.

Perl有两种​​存储字符串的格式:

Perl has two formats for storing strings:

  • UTF8 = 0可以存储8位值的序列.
  • UTF8 = 1可以存储72位值的序列(尽管实际上限制为32位或64位).

第一种格式使用较少的内存,并且在访问字符串中的特定位置时速度更快,但是它所包含的内容受到限制. (例如,它不能存储Unicode代码点,因为它们需要21位.)Perl可以在两者之间自由切换.

The first format uses less memory and is faster when it comes to access a specific position in the string, but it's limited in what it can contain. (For example, it can't store Unicode code points since they require 21 bits.) Perl can freely switch between the two.

use utf8;
use feature qw( say );

my $d = my $u = "abcdé";
utf8::downgrade($d);  # Switch to using the UTF8=0 format for $d.
utf8::upgrade($u);    # Switch to using the UTF8=1 format for $u.

say utf8::is_utf8($d) ?1:0;   # 0
say utf8::is_utf8($u) ?1:0;   # 1
say $d eq $u          ?1:0;   # 1

通常不必担心这一点,但是有错误的模块.尽管use feature qw( unicode_strings );,甚至还有Perl的错误角落.可以使用utf8::upgradeutf8::downgrade将标量的格式更改为XS函数期望的格式.

One normally doesn't have to worry about this, but there are buggy modules. There are even buggy corners of Perl remaining despite use feature qw( unicode_strings );. One can use utf8::upgrade and utf8::downgrade for changing the format of a scalar to that expected by the XS function.

或者它的名字不正确,该函数应命名为uni :: is_unicode($ data)???
Or it is miss-named and the function should be named as uni::is_unicode($data)???

那再好不过了. Perl无法知道字符串是否为Unicode字符串.如果您需要跟踪它,那么您需要自己对其进行跟踪.

That's no better. Perl has no way to know whether a string is a Unicode string or not. If you need to track that, you need to track it yourself.

UTF8 = 0格式的字符串可能包含Unicode代码点.

Strings in the UTF8=0 format may contain Unicode code points.

my $s = "abc";  # U+0041,0042,0043

UTF8 = 1格式的字符串可能包含不是Unicode代码点的值.

Strings in the UTF8=1 format may contain values that aren't Unicode code points.

my $s = pack('W*', @temperature_measurements);

这篇关于Perl Unicode内部-与utf8混淆的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆