与Perl一起使用Unicode方式的清单 [英] Checklist for going the Unicode way with Perl
问题描述
我正在帮助客户将其Perl平面文件公告板站点从ISO-8859-1转换为Unicode.
I am helping a client convert their Perl flat-file bulletin board site from ISO-8859-1 to Unicode.
由于这是我的第一次,所以我想知道以下清单"是否完整.一切都在测试中运行良好,但我可能会错过一些仅在极少数情况下才会发生的事情.
Since this is my first time, I would like to know if the following "checklist" is complete. Everything works well in testing, but I may be missing something which would only occur at rare occasions.
这是我到目前为止所做的事情(原谅我仅包括摘要"代码示例):
This is what I have done so far (forgive me for only including "summary" code examples):
-
确保始终以UTF-8读写文件:
Made sure files are always read and written in UTF-8:
use open ':utf8';
确保已将CGI输入作为UTF-8接收(该站点未使用CGI.pm):
Made sure CGI input is received as UTF-8 (the site is not using CGI.pm):
s{%([a-fA-F0-9]{2})}{ pack ("C", hex ($1)) }eg; # Kept from existing code
s{%u([0-9A-F]{4})}{ pack ('U*', hex ($1)) }eg; # Added
utf8::decode $_;
确保将文本打印为UTF-8:
Made sure text is printed as UTF-8:
binmode STDOUT, ':utf8';
确保浏览器将我的内容解释为UTF-8:
Made sure browsers interpret my content as UTF-8:
Content-Type: text/html; charset=UTF-8
<meta http-equiv="content-type" content="text/html;charset=UTF-8">
确保表单发送UTF-8(只要设置了页面编码,可能就没有必要):
Made sure forms send UTF-8 (probably not necessary as long as page encoding is set):
accept-charset="UTF-8"
不要认为我需要以下内容,因为内联文本(菜单,标题等)仅使用ASCII:
Don't think I need the following, since inline text (menus, headings, etc.) is only in ASCII:
use utf8;
这看起来合理吗?还是我错过了什么?
Does this looks reasonable or am I missing something?
我可能还应该提到,我们将运行一次批处理以读取所有现有的文本数据文件并将其保存为UTF-8编码.
I should probably also mention that we will be running a one-time batch to read all existing text data files and save them in UTF-8 encoding.
推荐答案
-
:utf8
PerlIO
层是PerlIO::encoding
层,因此::encoding(UTF-8)
.The
:utf8
PerlIO
layer is not strict enough. It permits input that fulfills the structural requirement of UTF-8 byte sequences, but for good security, you want to reject stuff that is not actually valid Unicode. Replace it everywhere with thePerlIO::encoding
layer, thus::encoding(UTF-8)
.出于相同的原因,请始终为
Encode::decode('UTF-8', …)
,而不是Encode::decode_utf8(…)
.For the same reason, always
Encode::decode('UTF-8', …)
, notEncode::decode_utf8(…)
.使解码异常失败,请比较:
Make decoding fail hard with an exception, compare:
perl -E'use Encode qw(decode); say decode(q(UTF-8), qq(\x{c0})); say q(survived)' perl -E'use Encode qw(decode); say decode(q(UTF-8), qq(\x{c0}), Encode::FB_CROAK); say q(survived)'
-
您并没有使用
%u
表示法中的代理对.这是我可以在您的列表中看到的唯一主要错误.2.
的正确写为: You are not taking care of surrogate pairs in the
%u
notation. This is the only major bug I can see in your list.2.
is written correctly as:use Encode qw(decode); use URI::Escape::XS qw(decodeURIComponent); $_ = decode('UTF-8', decodeURIComponent($_), Encode::FB_CROAK);
-
不要弄混
utf8
模块中的功能.它的文档中是这样说的.旨在告知Perl源代码是UTF-8.如果要进行编码/解码,请使用Encode
模块. Do not mess around with the functions from the
utf8
module. Its documentation says so. It's intended as a pragma to tell Perl that the source code is in UTF-8. If you want to do encoding/decoding, use theEncode
module.无论如何在每个模块中都添加
utf8
编译指示.不会有什么坏处,但是如果有人添加了这些字符串文字,您将可以进行将来的代码维护.另请参见CodeLayout::RequireUseUTF8
.Add the
utf8
pragma anyway in every module. It cannot hurt, but you will future-proof code maintenance in case someone adds those string literals. See alsoCodeLayout::RequireUseUTF8
.使用
encoding::warnings
抽出剩余的隐式升级.验证每种情况是否是预期的/需要的.如果是,请使用Unicode::Semantics
将其转换为显式升级.如果不是,则暗示您应该早先进行解码.来自 http://p3rl.org/UNI 的文档提供了从源接收到数据后立即解码的建议.遍历代码正在读取/写入数据的地方,并验证您是否有一个解码/编码步骤,可以是显式(decode('UTF-8', …)
)或隐式地通过一层(use open
pragma,binmode
,). Employ
encoding::warnings
to smoke out remaining implicit upgrades. Verify for each case whether this is intended/needed. If yes, convert it to an explicit upgrade withUnicode::Semantics
. If not, this is a hint that you should have earlier had a decoding step. The documents from http://p3rl.org/UNI give the advice to immediately decode after receiving the data from the source. Go over the places where the code is reading/writing data and verify you have a decoding/encoding step, either explicitly (decode('UTF-8', …)
) or implicitly through a layer (use open
pragma,binmode
, 3 argument form ofopen
).用于调试:如果不确定某个时间在表示形式的变量中包含什么字符串,则不能仅
print
使用工具Devel::Peek
代替.For debugging: If you are not sure what string is in a variable in which representation at a certain time, you cannot just
print
, use the toolsDevel::StringInfo
andDevel::Peek
instead.这篇关于与Perl一起使用Unicode方式的清单的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!