使用 Perl 使用 Unicode 方式的清单 [英] Checklist for going the Unicode way with Perl
问题描述
我正在帮助客户将他们的 Perl 平面文件公告板站点从 ISO-8859-1 转换为 Unicode.
I am helping a client convert their Perl flat-file bulletin board site from ISO-8859-1 to Unicode.
由于这是我第一次,我想知道以下清单"是否完整.在测试中一切正常,但我可能会遗漏一些只会在极少数情况下发生的东西.
Since this is my first time, I would like to know if the following "checklist" is complete. Everything works well in testing, but I may be missing something which would only occur at rare occasions.
这是我到目前为止所做的(请原谅我只包含摘要"代码示例):
This is what I have done so far (forgive me for only including "summary" code examples):
确保文件总是以 UTF-8 读取和写入:
Made sure files are always read and written in UTF-8:
use open ':utf8';
确保以 UTF-8 格式接收 CGI 输入(该站点未使用 CGI.pm):
Made sure CGI input is received as UTF-8 (the site is not using CGI.pm):
s{%([a-fA-F0-9]{2})}{ pack ("C", hex ($1)) }eg; # Kept from existing code
s{%u([0-9A-F]{4})}{ pack ('U*', hex ($1)) }eg; # Added
utf8::decode $_;
确保文本以 UTF-8 格式打印:
Made sure text is printed as UTF-8:
binmode STDOUT, ':utf8';
确保浏览器将我的内容解释为 UTF-8:
Made sure browsers interpret my content as UTF-8:
Content-Type: text/html; charset=UTF-8
<meta http-equiv="content-type" content="text/html;charset=UTF-8">
确保表单发送 UTF-8(可能不需要,只要设置了页面编码):
Made sure forms send UTF-8 (probably not necessary as long as page encoding is set):
accept-charset="UTF-8"
不要认为我需要以下内容,因为内嵌文本(菜单、标题等)仅采用 ASCII 格式:
Don't think I need the following, since inline text (menus, headings, etc.) is only in ASCII:
use utf8;
这看起来合理还是我遗漏了什么?
Does this looks reasonable or am I missing something?
我可能还应该提到我们将运行一次性批处理来读取所有现有的文本数据文件并将它们保存为 UTF-8 编码.
I should probably also mention that we will be running a one-time batch to read all existing text data files and save them in UTF-8 encoding.
推荐答案
:utf8
PerlIO
层不够严格.它允许输入满足 UTF-8 字节序列的结构要求,但为了安全起见,您希望拒绝实际上不是有效 Unicode 的内容.用PerlIO::encoding
层替换它,因此::encoding(UTF-8)
.The
:utf8
PerlIO
layer is not strict enough. It permits input that fulfills the structural requirement of UTF-8 byte sequences, but for good security, you want to reject stuff that is not actually valid Unicode. Replace it everywhere with thePerlIO::encoding
layer, thus::encoding(UTF-8)
.出于同样的原因,总是
Encode::decode('UTF-8', ...)
,而不是Encode::decode_utf8(...)
.For the same reason, always
Encode::decode('UTF-8', …)
, notEncode::decode_utf8(…)
.使解码失败并出现异常,比较:
Make decoding fail hard with an exception, compare:
perl -E'use Encode qw(decode); say decode(q(UTF-8), qq(x{c0})); say q(survived)' perl -E'use Encode qw(decode); say decode(q(UTF-8), qq(x{c0}), Encode::FB_CROAK); say q(survived)'
您没有处理
%u
符号中的代理对.这是我在您的列表中看到的唯一主要错误.2.
正确写成:You are not taking care of surrogate pairs in the
%u
notation. This is the only major bug I can see in your list.2.
is written correctly as:use Encode qw(decode); use URI::Escape::XS qw(decodeURIComponent); $_ = decode('UTF-8', decodeURIComponent($_), Encode::FB_CROAK);
不要乱用
utf8
模块中的函数.它的文档是这样说的.它的目的是告诉 Perl 源代码是 UTF-8 格式的 pragma.如果要进行编码/解码,请使用Encode
模块.无论如何在每个模块中添加
utf8
编译指示.它不会有什么坏处,但是如果有人添加了这些字符串文字,您将可以进行面向未来的代码维护.另请参阅CodeLayout::RequireUseUTF8
.Add the
utf8
pragma anyway in every module. It cannot hurt, but you will future-proof code maintenance in case someone adds those string literals. See alsoCodeLayout::RequireUseUTF8
.使用
encoding::warnings
来消除剩余的隐式升级.验证每个案例是否有意/需要这样做.如果是,请使用Unicode::Semantics
将其转换为显式升级.如果没有,这暗示您应该早些进行解码步骤.http://p3rl.org/UNI 的文档给出了在收到源数据后立即解码的建议.检查代码读取/写入数据的位置并验证您是否有解码/编码步骤,显式 (decode('UTF-8', ...)
) 或隐式通过层 (use open
pragma,binmode
,open
的 3 参数形式).Employ
encoding::warnings
to smoke out remaining implicit upgrades. Verify for each case whether this is intended/needed. If yes, convert it to an explicit upgrade withUnicode::Semantics
. If not, this is a hint that you should have earlier had a decoding step. The documents from http://p3rl.org/UNI give the advice to immediately decode after receiving the data from the source. Go over the places where the code is reading/writing data and verify you have a decoding/encoding step, either explicitly (decode('UTF-8', …)
) or implicitly through a layer (use open
pragma,binmode
, 3 argument form ofopen
).对于调试:如果您不确定某个时间在哪个表示中的变量中的哪个字符串,您不能只是
打印
,使用工具Devel::StringInfo
和Devel::Peek
代替.For debugging: If you are not sure what string is in a variable in which representation at a certain time, you cannot just
print
, use the toolsDevel::StringInfo
andDevel::Peek
instead.这篇关于使用 Perl 使用 Unicode 方式的清单的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!