如何在Perl中清除无效的UTF-8? [英] How do I sanitize invalid UTF-8 in Perl?

查看:76
本文介绍了如何在Perl中清除无效的UTF-8?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的Perl程序从磁盘文件中获取一些文本作为输入,将其包装为XML,然后将其输出到STDOUT.输入名义上是UTF-8,但有时会插入垃圾.我需要清理输出,以免发出无效的UTF-8八位位组,否则下游使用者(Sphinx)会崩溃.

My Perl program takes some text from a disk file as input, wraps it in some XML, then outputs it to STDOUT. The input is nominally UTF-8, but sometimes has junk inserted. I need to sanitize the output such that no invalid UTF-8 octets are emitted, otherwise the downstream consumer (Sphinx) will blow up.

至少我想知道是否数据无效,所以我可以避免将其继续传递;理想情况下,我可以只删除有问题的字节.但是,启用所有我能找到的宿命论并不能使我完全了解perl 5.12(FWIW,use v5.12; use warnings qw( FATAL utf8 );生效).

At the very least I would like to know if the data is invalid so I can avoid passing it on; ideally I could remove just the offending bytes. However, enabling all the fatalisms I can find doesn't quite get me there with perl 5.12 (FWIW, use v5.12; use warnings qw( FATAL utf8 ); is in effect).

我在处理序列"\xFE\xBF\xBE"时特别麻烦.如果我创建一个仅包含这三个字节(perl -e 'print "\xEF\xBF\xBE"' > bad.txt)的文件,则尝试以模式:encoding(UTF-8)读取文件,但utf8 "\xFFFE" does not map to Unicode会出错,但仅限于5.14.0以下. 5.12.3和更早的版本可以很好地阅读,以后再编写该序列.我不确定从何处获得\xFFFE(非法反向BOM),但至少有人抱怨与Sphinx保持一致.

I'm specifically having trouble with the sequence "\xFE\xBF\xBE". If I create a file containing only these three bytes (perl -e 'print "\xEF\xBF\xBE"' > bad.txt), trying to read the file with mode :encoding(UTF-8) errors out with utf8 "\xFFFE" does not map to Unicode, but only under 5.14.0. 5.12.3 and earlier are perfectly fine reading and later writing that sequence. I'm unsure where it's getting the \xFFFE (illegal reverse-BOM) from, but at least having a complaint is consistent with Sphinx.

不幸的是,decode_utf8("\xEF\xBF\xBE", 1)在5.12或5.14下不会导致任何错误.我更喜欢不需要编码的I/O层的检测方法,因为这只会给我留下一条错误消息,并且无法清除原始八位字节.

Unfortunately, decode_utf8("\xEF\xBF\xBE", 1) causes no errors under 5.12 or 5.14. I'd prefer a detection method that didn't require an encoded I/O layer, as that will just leave me with an error message and no way to sanitize the raw octets.

我确定还需要解决更多的序列,但是仅仅处理这个序列就可以了.所以我的问题是:我可以在5.14之前的Perl中可靠地检测到此类问题数据吗?什么样的替代程序通常可以将几乎UTF-8消毒为严格的UTF-8?

I'm sure there are more sequences that I need to address, but just handling this one would be a start. So my questions are: can I reliably detect this kind of problem data with a perl before 5.14? What substitution routine can generally sanitize almost-UTF-8 into strict UTF-8?

推荐答案

您应该阅读编码的"noreferrer"> UTF-8与utf8与UTF8部分文档.

You should read the UTF-8 vs. utf8 vs. UTF8 section of the Encode docs.

总而言之,Perl有两种​​不同的UTF-8编码.它的本机编码称为utf8,并且基本上允许任何代码点,而与Unicode标准关于该代码点的规定无关.

To summarize, Perl has two different UTF-8 encodings. Its native encoding is called utf8, and basically allows any codepoint, regardless of what the Unicode standard says about that codepoint.

另一种编码称为utf-8(也称为utf-8-strict).这仅允许Unicode标准列出的合法交换代码点.

The other encoding is called utf-8 (a.k.a. utf-8-strict). This allows only codepoints that are listed as legal for interchange by the Unicode standard.

"\xEF\xBF\xBE"解释为UTF-8时,将解码为代码点 U + FFFE .但这对于根据Unicode进行交换是不合法的,因此对此类事情严格的程序会抱怨.

"\xEF\xBF\xBE", when interpreted as UTF-8, decodes to the codepoint U+FFFE. But that's not legal for interchange according to Unicode, so programs that are strict about such things complain.

而不是使用decode_utf8(它使用宽松的utf8编码),而是将decodeutf-8编码一起使用.并阅读处理格式错误的数据部分,以了解处理或抱怨问题的不同方式

Instead of using decode_utf8 (which uses the lax utf8 encoding), use decode with the utf-8 encoding. And read the Handling Malformed Data section to see the different ways you can handle or complain about problems.

更新:即使使用utf-8-strict编码,某些版本的Perl似乎也不会抱怨U + FFFE.这似乎是一个错误.您可能只需要构建Sphinx抱怨的代码点列表并手动将其过滤掉(例如,使用 tr ).

Update: It does appear that some versions of Perl don't complain about U+FFFE, even when using the utf-8-strict encoding. This appears to be a bug. You may just have to build a list of codepoints that Sphinx complains about and filter them out manually (e.g. with tr).

这篇关于如何在Perl中清除无效的UTF-8?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆