如何在Perl中清除无效的UTF-8? [英] How do I sanitize invalid UTF-8 in Perl?
问题描述
我的Perl程序从磁盘文件中获取一些文本作为输入,将其包装为XML,然后将其输出到STDOUT.输入名义上是UTF-8,但有时会插入垃圾.我需要清理输出,以免发出无效的UTF-8八位位组,否则下游使用者(Sphinx)会崩溃.
My Perl program takes some text from a disk file as input, wraps it in some XML, then outputs it to STDOUT. The input is nominally UTF-8, but sometimes has junk inserted. I need to sanitize the output such that no invalid UTF-8 octets are emitted, otherwise the downstream consumer (Sphinx) will blow up.
至少我想知道是否数据无效,所以我可以避免将其继续传递;理想情况下,我可以只删除有问题的字节.但是,启用所有我能找到的宿命论并不能使我完全了解perl 5.12(FWIW,use v5.12; use warnings qw( FATAL utf8 );
生效).
At the very least I would like to know if the data is invalid so I can avoid passing it on; ideally I could remove just the offending bytes. However, enabling all the fatalisms I can find doesn't quite get me there with perl 5.12 (FWIW, use v5.12; use warnings qw( FATAL utf8 );
is in effect).
我在处理序列"\xFE\xBF\xBE"
时特别麻烦.如果我创建一个仅包含这三个字节(perl -e 'print "\xEF\xBF\xBE"' > bad.txt
)的文件,则尝试以模式:encoding(UTF-8)
读取文件,但utf8 "\xFFFE" does not map to Unicode
会出错,但仅限于5.14.0以下. 5.12.3和更早的版本可以很好地阅读,以后再编写该序列.我不确定从何处获得\xFFFE
(非法反向BOM),但至少有人抱怨与Sphinx保持一致.
I'm specifically having trouble with the sequence "\xFE\xBF\xBE"
. If I create a file containing only these three bytes (perl -e 'print "\xEF\xBF\xBE"' > bad.txt
), trying to read the file with mode :encoding(UTF-8)
errors out with utf8 "\xFFFE" does not map to Unicode
, but only under 5.14.0. 5.12.3 and earlier are perfectly fine reading and later writing that sequence. I'm unsure where it's getting the \xFFFE
(illegal reverse-BOM) from, but at least having a complaint is consistent with Sphinx.
不幸的是,decode_utf8("\xEF\xBF\xBE", 1)
在5.12或5.14下不会导致任何错误.我更喜欢不需要编码的I/O层的检测方法,因为这只会给我留下一条错误消息,并且无法清除原始八位字节.
Unfortunately, decode_utf8("\xEF\xBF\xBE", 1)
causes no errors under 5.12 or 5.14. I'd prefer a detection method that didn't require an encoded I/O layer, as that will just leave me with an error message and no way to sanitize the raw octets.
我确定还需要解决更多的序列,但是仅仅处理这个序列就可以了.所以我的问题是:我可以在5.14之前的Perl中可靠地检测到此类问题数据吗?什么样的替代程序通常可以将几乎UTF-8消毒为严格的UTF-8?
I'm sure there are more sequences that I need to address, but just handling this one would be a start. So my questions are: can I reliably detect this kind of problem data with a perl before 5.14? What substitution routine can generally sanitize almost-UTF-8 into strict UTF-8?
推荐答案
您应该阅读编码的"noreferrer"> UTF-8与utf8与UTF8部分文档.
You should read the UTF-8 vs. utf8 vs. UTF8 section of the Encode docs.
总而言之,Perl有两种不同的UTF-8编码.它的本机编码称为utf8
,并且基本上允许任何代码点,而与Unicode标准关于该代码点的规定无关.
To summarize, Perl has two different UTF-8 encodings. Its native encoding is called utf8
, and basically allows any codepoint, regardless of what the Unicode standard says about that codepoint.
另一种编码称为utf-8
(也称为utf-8-strict
).这仅允许Unicode标准列出的合法交换代码点.
The other encoding is called utf-8
(a.k.a. utf-8-strict
). This allows only codepoints that are listed as legal for interchange by the Unicode standard.
"\xEF\xBF\xBE"
解释为UTF-8时,将解码为代码点 U + FFFE .但这对于根据Unicode进行交换是不合法的,因此对此类事情严格的程序会抱怨.
"\xEF\xBF\xBE"
, when interpreted as UTF-8, decodes to the codepoint U+FFFE. But that's not legal for interchange according to Unicode, so programs that are strict about such things complain.
而不是使用decode_utf8
(它使用宽松的utf8
编码),而是将decode
与utf-8
编码一起使用.并阅读处理格式错误的数据部分,以了解处理或抱怨问题的不同方式
Instead of using decode_utf8
(which uses the lax utf8
encoding), use decode
with the utf-8
encoding. And read the Handling Malformed Data section to see the different ways you can handle or complain about problems.
更新:即使使用utf-8-strict
编码,某些版本的Perl似乎也不会抱怨U + FFFE.这似乎是一个错误.您可能只需要构建Sphinx抱怨的代码点列表并手动将其过滤掉(例如,使用 tr
).
Update: It does appear that some versions of Perl don't complain about U+FFFE, even when using the utf-8-strict
encoding. This appears to be a bug. You may just have to build a list of codepoints that Sphinx complains about and filter them out manually (e.g. with tr
).
这篇关于如何在Perl中清除无效的UTF-8?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!