如何在 Perl 中清理无效的 UTF-8? [英] How do I sanitize invalid UTF-8 in Perl?
问题描述
我的 Perl 程序从磁盘文件中获取一些文本作为输入,将其包装在一些 XML 中,然后将其输出到 STDOUT.输入名义上是 UTF-8,但有时会插入垃圾.我需要清理输出,以免发出无效的 UTF-8 八位字节,否则下游消费者 (Sphinx) 会崩溃.
My Perl program takes some text from a disk file as input, wraps it in some XML, then outputs it to STDOUT. The input is nominally UTF-8, but sometimes has junk inserted. I need to sanitize the output such that no invalid UTF-8 octets are emitted, otherwise the downstream consumer (Sphinx) will blow up.
至少我想知道是否数据无效,这样我就可以避免传递它;理想情况下,我可以只删除有问题的字节.但是,启用我能找到的所有宿命论并不能完全让我使用 perl 5.12(FWIW,使用 v5.12;使用警告 qw(FATAL utf8);
有效).
At the very least I would like to know if the data is invalid so I can avoid passing it on; ideally I could remove just the offending bytes. However, enabling all the fatalisms I can find doesn't quite get me there with perl 5.12 (FWIW, use v5.12; use warnings qw( FATAL utf8 );
is in effect).
我在序列 "xFExBFxBE"
上遇到了问题.如果我创建一个只包含这三个字节的文件(perl -e 'print "xEFxBFxBE"' > bad.txt
),则尝试使用 模式读取文件:encoding(UTF-8)
错误与 utf8 "xFFFE" 不映射到 Unicode
,但仅在 5.14.0 下.5.12.3 和更早的版本完全可以阅读,然后再编写该序列.我不确定它从哪里获得 xFFFE
(非法反向 BOM),但至少有一个投诉与 Sphinx 一致.
I'm specifically having trouble with the sequence "xFExBFxBE"
. If I create a file containing only these three bytes (perl -e 'print "xEFxBFxBE"' > bad.txt
), trying to read the file with mode :encoding(UTF-8)
errors out with utf8 "xFFFE" does not map to Unicode
, but only under 5.14.0. 5.12.3 and earlier are perfectly fine reading and later writing that sequence. I'm unsure where it's getting the xFFFE
(illegal reverse-BOM) from, but at least having a complaint is consistent with Sphinx.
不幸的是,decode_utf8("xEFxBFxBE", 1)
在 5.12 或 5.14 下不会导致任何错误.我更喜欢不需要编码 I/O 层的检测方法,因为这只会给我留下错误消息,而无法清理原始八位字节.
Unfortunately, decode_utf8("xEFxBFxBE", 1)
causes no errors under 5.12 or 5.14. I'd prefer a detection method that didn't require an encoded I/O layer, as that will just leave me with an error message and no way to sanitize the raw octets.
我确信我需要处理更多序列,但处理这个序列只是一个开始.所以我的问题是:我能在 5.14 之前用 perl 可靠地检测到这种问题数据吗?什么替换例程通常可以将几乎-UTF-8 净化为严格的 UTF-8?
I'm sure there are more sequences that I need to address, but just handling this one would be a start. So my questions are: can I reliably detect this kind of problem data with a perl before 5.14? What substitution routine can generally sanitize almost-UTF-8 into strict UTF-8?
推荐答案
您应该阅读 编码的noreferrer">UTF-8 vs. utf8 vs. UTF8 部分文档.
You should read the UTF-8 vs. utf8 vs. UTF8 section of the Encode docs.
总而言之,Perl 有两种不同的 UTF-8 编码.它的本机编码称为 utf8
,基本上允许任何代码点,而不管 Unicode 标准对该代码点的规定如何.
To summarize, Perl has two different UTF-8 encodings. Its native encoding is called utf8
, and basically allows any codepoint, regardless of what the Unicode standard says about that codepoint.
另一种编码称为utf-8
(也称为utf-8-strict
).这仅允许被 Unicode 标准列为合法交换的代码点.
The other encoding is called utf-8
(a.k.a. utf-8-strict
). This allows only codepoints that are listed as legal for interchange by the Unicode standard.
"xEFxBFxBE"
,当解释为 UTF-8 时,解码为代码点 U+FFFE.但是根据 Unicode,这对于交换是不合法的,因此对此类事情严格的程序会抱怨.
"xEFxBFxBE"
, when interpreted as UTF-8, decodes to the codepoint U+FFFE. But that's not legal for interchange according to Unicode, so programs that are strict about such things complain.
不要使用 decode_utf8
(使用松散的 utf8
编码),而是使用 decode
和 utf-8
> 编码.并阅读处理格式错误的数据部分,了解处理或投诉问题的不同方式.
Instead of using decode_utf8
(which uses the lax utf8
encoding), use decode
with the utf-8
encoding. And read the Handling Malformed Data section to see the different ways you can handle or complain about problems.
更新:看起来有些版本的 Perl 并没有抱怨 U+FFFE,即使使用 utf-8-strict
编码也是如此.这似乎是一个错误.您可能只需要构建一个 Sphinx 抱怨的代码点列表并手动将它们过滤掉(例如使用 tr
).
Update: It does appear that some versions of Perl don't complain about U+FFFE, even when using the utf-8-strict
encoding. This appears to be a bug. You may just have to build a list of codepoints that Sphinx complains about and filter them out manually (e.g. with tr
).
这篇关于如何在 Perl 中清理无效的 UTF-8?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!