如何在 Perl 中清理无效的 UTF-8? [英] How do I sanitize invalid UTF-8 in Perl?

查看:29
本文介绍了如何在 Perl 中清理无效的 UTF-8?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的 Perl 程序从磁盘文件中获取一些文本作为输入,将其包装在一些 XML 中,然后将其输出到 STDOUT.输入名义上是 UTF-8,但有时会插入垃圾.我需要清理输出,以免发出无效的 UTF-8 八位字节,否则下游消费者 (Sphinx) 会崩溃.

My Perl program takes some text from a disk file as input, wraps it in some XML, then outputs it to STDOUT. The input is nominally UTF-8, but sometimes has junk inserted. I need to sanitize the output such that no invalid UTF-8 octets are emitted, otherwise the downstream consumer (Sphinx) will blow up.

至少我想知道是否数据无效,这样我就可以避免传递它;理想情况下,我可以只删除有问题的字节.但是,启用我能找到的所有宿命论并不能完全让我使用 perl 5.12(FWIW,使用 v5.12;使用警告 qw(FATAL utf8); 有效).

At the very least I would like to know if the data is invalid so I can avoid passing it on; ideally I could remove just the offending bytes. However, enabling all the fatalisms I can find doesn't quite get me there with perl 5.12 (FWIW, use v5.12; use warnings qw( FATAL utf8 ); is in effect).

我在序列 "xFExBFxBE" 上遇到了问题.如果我创建一个只包含这三个字节的文件(perl -e 'print "xEFxBFxBE"' > bad.txt),则尝试使用 模式读取文件:encoding(UTF-8) 错误与 utf8 "xFFFE" 不映射到 Unicode,但仅在 5.14.0 下.5.12.3 和更早的版本完全可以阅读,然后再编写该序列.我不确定它从哪里获得 xFFFE(非法反向 BOM),但至少有一个投诉与 Sphinx 一致.

I'm specifically having trouble with the sequence "xFExBFxBE". If I create a file containing only these three bytes (perl -e 'print "xEFxBFxBE"' > bad.txt), trying to read the file with mode :encoding(UTF-8) errors out with utf8 "xFFFE" does not map to Unicode, but only under 5.14.0. 5.12.3 and earlier are perfectly fine reading and later writing that sequence. I'm unsure where it's getting the xFFFE (illegal reverse-BOM) from, but at least having a complaint is consistent with Sphinx.

不幸的是,decode_utf8("xEFxBFxBE", 1) 在 5.12 或 5.14 下不会导致任何错误.我更喜欢不需要编码 I/O 层的检测方法,因为这只会给我留下错误消息,而无法清理原始八位字节.

Unfortunately, decode_utf8("xEFxBFxBE", 1) causes no errors under 5.12 or 5.14. I'd prefer a detection method that didn't require an encoded I/O layer, as that will just leave me with an error message and no way to sanitize the raw octets.

我确信我需要处理更多序列,但处理这个序列只是一个开始.所以我的问题是:我能在 5.14 之前用 perl 可靠地检测到这种问题数据吗?什么替换例程通常可以将几乎-UTF-8 净化为严格的 UTF-8?

I'm sure there are more sequences that I need to address, but just handling this one would be a start. So my questions are: can I reliably detect this kind of problem data with a perl before 5.14? What substitution routine can generally sanitize almost-UTF-8 into strict UTF-8?

推荐答案

您应该阅读 编码的noreferrer">UTF-8 vs. utf8 vs. UTF8 部分文档.

You should read the UTF-8 vs. utf8 vs. UTF8 section of the Encode docs.

总而言之,Perl 有两种不同的 UTF-8 编码.它的本机编码称为 utf8,基本上允许任何代码点,而不管 Unicode 标准对该代码点的规定如何.

To summarize, Perl has two different UTF-8 encodings. Its native encoding is called utf8, and basically allows any codepoint, regardless of what the Unicode standard says about that codepoint.

另一种编码称为utf-8(也称为utf-8-strict).这仅允许被 Unicode 标准列为合法交换的代码点.

The other encoding is called utf-8 (a.k.a. utf-8-strict). This allows only codepoints that are listed as legal for interchange by the Unicode standard.

"xEFxBFxBE",当解释为 UTF-8 时,解码为代码点 U+FFFE.但是根据 Unicode,这对于交换是不合法的,因此对此类事情严格的程序会抱怨.

"xEFxBFxBE", when interpreted as UTF-8, decodes to the codepoint U+FFFE. But that's not legal for interchange according to Unicode, so programs that are strict about such things complain.

不要使用 decode_utf8(使用松散的 utf8 编码),而是使用 decodeutf-8> 编码.并阅读处理格式错误的数据部分,了解处理或投诉问题的不同方式.

Instead of using decode_utf8 (which uses the lax utf8 encoding), use decode with the utf-8 encoding. And read the Handling Malformed Data section to see the different ways you can handle or complain about problems.

更新:看起来有些版本的 Perl 并没有抱怨 U+FFFE,即使使用 utf-8-strict 编码也是如此.这似乎是一个错误.您可能只需要构建一个 Sphinx 抱怨的代码点列表并手动将它们过滤掉(例如使用 tr).

Update: It does appear that some versions of Perl don't complain about U+FFFE, even when using the utf-8-strict encoding. This appears to be a bug. You may just have to build a list of codepoints that Sphinx complains about and filter them out manually (e.g. with tr).

这篇关于如何在 Perl 中清理无效的 UTF-8?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆