如何在Perl中找到Unicode字符串的长度? [英] How do I find the length of a Unicode string in Perl?

查看:168
本文介绍了如何在Perl中找到Unicode字符串的长度?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

length()perldoc页告诉我应该使用bytes::length(EXPR)查找以字节为单位的Unicode字符串,或者字节页面对此进行回显.

The perldoc page for length() tells me that I should use bytes::length(EXPR) to find a Unicode string in bytes, or and the bytes page echoes this.

use bytes;
$ascii = 'Lorem ipsum dolor sit amet';
$unicode = 'Lørëm ípsüm dölör sît åmét';

print "ASCII: " . length($ascii) . "\n";
print "ASCII bytes: " . bytes::length($ascii) . "\n";
print "Unicode: " . length($unicode) . "\n";
print "Unicode bytes: " . bytes::length($unicode) . "\n";

但是,此脚本的输出与联机帮助页不同:

The output of this script, however, disagrees with the manpage:

ASCII: 26
ASCII bytes: 26
Unicode: 35
Unicode bytes: 35

在我看来,length()和bytes :: length()对于ASCII& Unicode字符串.我已将编辑器设置为默认情况下以UTF-8格式写入文件,因此我认为Perl会将整个脚本解释为Unicode,这是否意味着length()自动正确处理Unicode字符串?

It seems to me length() and bytes::length() return the same for both ASCII & Unicode strings. I have my editor set to write files as UTF-8 by default, so I figure Perl is interpreting the whole script as Unicode—does that mean length() automatically handles Unicode strings properly?

编辑:查看我的评论;我的问题没有多大意义,因为在上面的示例中length() not 不能正常"工作-它显示的是Unicode字符串的长度,以字节为单位,而不是字符.我最初偶然发现的Reson是针对一个程序的,该程序中我需要在HTTP消息中设置Content-Lenth标头(以字节为单位).我已经在Perl中阅读了Unicode,并期望做一些奇特的事情才能使事情正常工作,但是当length()恰好返回我需要的东西时,我很困惑!有关Perl中use utf8use bytesno bytes的概述,请参见接受的答案.

See my comment; my question doesn't make a whole lot of sense, because length() is not working "properly" in the above example - it is showing the length of the Unicode string in bytes, not characters. The reson I originally stumbled across this is for a program in which I need to set the Content-Lenth header (in bytes) in an HTTP message. I had read up on Unicode in Perl and was expecting to have to do some fanciness to make things work, but when length() returned exactly what I needed right of the bat, I was confused! See the accepted answer for an overview of use utf8, use bytes, and no bytes in Perl.

推荐答案

如果您的脚本使用UTF-8编码,请使用 utf8编译.另一方面,即使字符串为UTF-, bytes pragma 仍将强制字节语义. 8.两者都在当前的词法范围内起作用.

If your scripts are encoded in UTF-8, then please use the utf8 pragma. The bytes pragma on the other hand will force byte semantics on length, even if the string is UTF-8. Both work in the current lexical scope.

$ascii = 'Lorem ipsum dolor sit amet';
{
    use utf8;
    $unicode = 'Lørëm ípsüm dölör sît åmét';
}
$not_unicode = 'Lørëm ípsüm dölör sît åmét';

no bytes; # default, can be omitted
print "Character semantics:\n";

print "ASCII: ", length($ascii), "\n";
print "Unicode: ", length($unicode), "\n";
print "Not-Unicode: ", length($not_unicode), "\n";

print "----\n";

use bytes;
print "Byte semantics:\n";

print "ASCII: ", length($ascii), "\n";
print "Unicode: ", length($unicode), "\n";
print "Not-Unicode: ", length($not_unicode), "\n";

这将输出:

Character semantics:
ASCII: 26
Unicode: 26
Not-Unicode: 35
----
Byte semantics:
ASCII: 26
Unicode: 35
Not-Unicode: 35

这篇关于如何在Perl中找到Unicode字符串的长度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆