如何在Perl中清除无效的UTF-8? [英] How do I sanitize invalid UTF-8 in Perl?

查看：76 发布时间：2020/5/25 18:48:11 perl utf-8 sanitization

本文介绍了如何在Perl中清除无效的UTF-8?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的Perl程序从磁盘文件中获取一些文本作为输入，将其包装为XML，然后将其输出到STDOUT.输入名义上是UTF-8，但有时会插入垃圾.我需要清理输出，以免发出无效的UTF-8八位位组，否则下游使用者(Sphinx)会崩溃.

My Perl program takes some text from a disk file as input, wraps it in some XML, then outputs it to STDOUT. The input is nominally UTF-8, but sometimes has junk inserted. I need to sanitize the output such that no invalid UTF-8 octets are emitted, otherwise the downstream consumer (Sphinx) will blow up.

至少我想知道是否数据无效，所以我可以避免将其继续传递；理想情况下，我可以只删除有问题的字节.但是，启用所有我能找到的宿命论并不能使我完全了解perl 5.12(FWIW，use v5.12; use warnings qw( FATAL utf8 );生效).

At the very least I would like to know if the data is invalid so I can avoid passing it on; ideally I could remove just the offending bytes. However, enabling all the fatalisms I can find doesn't quite get me there with perl 5.12 (FWIW, use v5.12; use warnings qw( FATAL utf8 ); is in effect).

我在处理序列"\xFE\xBF\xBE"时特别麻烦.如果我创建一个仅包含这三个字节(perl -e 'print "\xEF\xBF\xBE"' > bad.txt)的文件，则尝试以模式:encoding(UTF-8)读取文件，但utf8 "\xFFFE" does not map to Unicode会出错，但仅限于5.14.0以下. 5.12.3和更早的版本可以很好地阅读，以后再编写该序列.我不确定从何处获得\xFFFE(非法反向BOM)，但至少有人抱怨与Sphinx保持一致.

I'm specifically having trouble with the sequence "\xFE\xBF\xBE". If I create a file containing only these three bytes (perl -e 'print "\xEF\xBF\xBE"' > bad.txt), trying to read the file with mode :encoding(UTF-8) errors out with utf8 "\xFFFE" does not map to Unicode, but only under 5.14.0. 5.12.3 and earlier are perfectly fine reading and later writing that sequence. I'm unsure where it's getting the \xFFFE (illegal reverse-BOM) from, but at least having a complaint is consistent with Sphinx.

不幸的是，decode_utf8("\xEF\xBF\xBE", 1)在5.12或5.14下不会导致任何错误.我更喜欢不需要编码的I/O层的检测方法，因为这只会给我留下一条错误消息，并且无法清除原始八位字节.

Unfortunately, decode_utf8("\xEF\xBF\xBE", 1) causes no errors under 5.12 or 5.14. I'd prefer a detection method that didn't require an encoded I/O layer, as that will just leave me with an error message and no way to sanitize the raw octets.

我确定还需要解决更多的序列，但是仅仅处理这个序列就可以了.所以我的问题是:我可以在5.14之前的Perl中可靠地检测到此类问题数据吗?什么样的替代程序通常可以将几乎UTF-8消毒为严格的UTF-8?

I'm sure there are more sequences that I need to address, but just handling this one would be a start. So my questions are: can I reliably detect this kind of problem data with a perl before 5.14? What substitution routine can generally sanitize almost-UTF-8 into strict UTF-8?

如何在Perl中清除无效的UTF-8? [英] How do I sanitize invalid UTF-8 in Perl?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何在Perl中清除无效的UTF-8? [英] How do I sanitize invalid UTF-8 in Perl?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭