如何在 Perl 中清理无效的 UTF-8? [英] How do I sanitize invalid UTF-8 in Perl?

查看：29 发布时间：2021/12/10 18:43:50 perl utf-8 sanitization

本文介绍了如何在 Perl 中清理无效的 UTF-8?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的 Perl 程序从磁盘文件中获取一些文本作为输入，将其包装在一些 XML 中，然后将其输出到 STDOUT.输入名义上是 UTF-8，但有时会插入垃圾.我需要清理输出，以免发出无效的 UTF-8 八位字节，否则下游消费者 (Sphinx) 会崩溃.

My Perl program takes some text from a disk file as input, wraps it in some XML, then outputs it to STDOUT. The input is nominally UTF-8, but sometimes has junk inserted. I need to sanitize the output such that no invalid UTF-8 octets are emitted, otherwise the downstream consumer (Sphinx) will blow up.

至少我想知道是否数据无效，这样我就可以避免传递它；理想情况下，我可以只删除有问题的字节.但是，启用我能找到的所有宿命论并不能完全让我使用 perl 5.12(FWIW，使用 v5.12；使用警告 qw(FATAL utf8)； 有效).

At the very least I would like to know if the data is invalid so I can avoid passing it on; ideally I could remove just the offending bytes. However, enabling all the fatalisms I can find doesn't quite get me there with perl 5.12 (FWIW, use v5.12; use warnings qw( FATAL utf8 ); is in effect).

我在序列 "xFExBFxBE" 上遇到了问题.如果我创建一个只包含这三个字节的文件(perl -e 'print "xEFxBFxBE"' > bad.txt)，则尝试使用 模式读取文件:encoding(UTF-8) 错误与 utf8 "xFFFE" 不映射到 Unicode，但仅在 5.14.0 下.5.12.3 和更早的版本完全可以阅读，然后再编写该序列.我不确定它从哪里获得 xFFFE(非法反向 BOM)，但至少有一个投诉与 Sphinx 一致.

I'm specifically having trouble with the sequence "xFExBFxBE". If I create a file containing only these three bytes (perl -e 'print "xEFxBFxBE"' > bad.txt), trying to read the file with mode :encoding(UTF-8) errors out with utf8 "xFFFE" does not map to Unicode, but only under 5.14.0. 5.12.3 and earlier are perfectly fine reading and later writing that sequence. I'm unsure where it's getting the xFFFE (illegal reverse-BOM) from, but at least having a complaint is consistent with Sphinx.

不幸的是，decode_utf8("xEFxBFxBE", 1) 在 5.12 或 5.14 下不会导致任何错误.我更喜欢不需要编码 I/O 层的检测方法，因为这只会给我留下错误消息，而无法清理原始八位字节.

Unfortunately, decode_utf8("xEFxBFxBE", 1) causes no errors under 5.12 or 5.14. I'd prefer a detection method that didn't require an encoded I/O layer, as that will just leave me with an error message and no way to sanitize the raw octets.

我确信我需要处理更多序列，但处理这个序列只是一个开始.所以我的问题是:我能在 5.14 之前用 perl 可靠地检测到这种问题数据吗?什么替换例程通常可以将几乎-UTF-8 净化为严格的 UTF-8?

I'm sure there are more sequences that I need to address, but just handling this one would be a start. So my questions are: can I reliably detect this kind of problem data with a perl before 5.14? What substitution routine can generally sanitize almost-UTF-8 into strict UTF-8?

如何在 Perl 中清理无效的 UTF-8? [英] How do I sanitize invalid UTF-8 in Perl?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何在 Perl 中清理无效的 UTF-8? [英] How do I sanitize invalid UTF-8 in Perl?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭