如何在Perl中找到Unicode字符串的长度? [英] How do I find the length of a Unicode string in Perl?
问题描述
length()的perldoc
页告诉我应该使用bytes::length(EXPR)
查找以字节为单位的Unicode字符串,或者字节页面对此进行回显.
The perldoc
page for length() tells me that I should use bytes::length(EXPR)
to find a Unicode string in bytes, or and the bytes page echoes this.
use bytes;
$ascii = 'Lorem ipsum dolor sit amet';
$unicode = 'Lørëm ípsüm dölör sît åmét';
print "ASCII: " . length($ascii) . "\n";
print "ASCII bytes: " . bytes::length($ascii) . "\n";
print "Unicode: " . length($unicode) . "\n";
print "Unicode bytes: " . bytes::length($unicode) . "\n";
但是,此脚本的输出与联机帮助页不同:
The output of this script, however, disagrees with the manpage:
ASCII: 26
ASCII bytes: 26
Unicode: 35
Unicode bytes: 35
在我看来,length()和bytes :: length()对于ASCII& Unicode字符串.我已将编辑器设置为默认情况下以UTF-8格式写入文件,因此我认为Perl会将整个脚本解释为Unicode,这是否意味着length()自动正确处理Unicode字符串?
It seems to me length() and bytes::length() return the same for both ASCII & Unicode strings. I have my editor set to write files as UTF-8 by default, so I figure Perl is interpreting the whole script as Unicode—does that mean length() automatically handles Unicode strings properly?
编辑:查看我的评论;我的问题没有多大意义,因为在上面的示例中length() not 不能正常"工作-它显示的是Unicode字符串的长度,以字节为单位,而不是字符.我最初偶然发现的Reson是针对一个程序的,该程序中我需要在HTTP消息中设置Content-Lenth标头(以字节为单位).我已经在Perl中阅读了Unicode,并期望做一些奇特的事情才能使事情正常工作,但是当length()恰好返回我需要的东西时,我很困惑!有关Perl中use utf8
,use bytes
和no bytes
的概述,请参见接受的答案.
See my comment; my question doesn't make a whole lot of sense, because length() is not working "properly" in the above example - it is showing the length of the Unicode string in bytes, not characters. The reson I originally stumbled across this is for a program in which I need to set the Content-Lenth header (in bytes) in an HTTP message. I had read up on Unicode in Perl and was expecting to have to do some fanciness to make things work, but when length() returned exactly what I needed right of the bat, I was confused! See the accepted answer for an overview of use utf8
, use bytes
, and no bytes
in Perl.
推荐答案
如果您的脚本使用UTF-8编码,请使用 utf8编译.另一方面,即使字符串为UTF-, bytes pragma 仍将强制字节语义. 8.两者都在当前的词法范围内起作用.
If your scripts are encoded in UTF-8, then please use the utf8 pragma. The bytes pragma on the other hand will force byte semantics on length, even if the string is UTF-8. Both work in the current lexical scope.
$ascii = 'Lorem ipsum dolor sit amet';
{
use utf8;
$unicode = 'Lørëm ípsüm dölör sît åmét';
}
$not_unicode = 'Lørëm ípsüm dölör sît åmét';
no bytes; # default, can be omitted
print "Character semantics:\n";
print "ASCII: ", length($ascii), "\n";
print "Unicode: ", length($unicode), "\n";
print "Not-Unicode: ", length($not_unicode), "\n";
print "----\n";
use bytes;
print "Byte semantics:\n";
print "ASCII: ", length($ascii), "\n";
print "Unicode: ", length($unicode), "\n";
print "Not-Unicode: ", length($not_unicode), "\n";
这将输出:
Character semantics:
ASCII: 26
Unicode: 26
Not-Unicode: 35
----
Byte semantics:
ASCII: 26
Unicode: 35
Not-Unicode: 35
这篇关于如何在Perl中找到Unicode字符串的长度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!