如何检查文件是否为有效的 UTF-8? [英] How to check whether a file is valid UTF-8?

查看:42
本文介绍了如何检查文件是否为有效的 UTF-8?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理一些应该是有效的 UTF-8 但不是的数据文件,这会导致解析器(不在我的控制之下)失败.我想添加一个预先验证 UTF-8 格式良好的数据的阶段,但我还没有找到一个实用程序来帮助做到这一点.

I'm processing some data files that are supposed to be valid UTF-8 but aren't, which causes the parser (not under my control) to fail. I'd like to add a stage of pre-validating the data for UTF-8 well-formedness, but I've not yet found a utility to help do this.

W3C 上有一个 网络服务,它似乎是死了,我发现了一个仅限 Windows 的验证 工具 报告无效的 UTF-8 文件但不报告要修复的行/字符.

There's a web service at W3C which appears to be dead, and I've found a Windows-only validation tool that reports invalid UTF-8 files but doesn't report which lines/characters to fix.

我会很高兴有一个我可以放入并使用的工具(理想情况下是跨平台的),或者我可以作为数据加载过程的一部分的 ruby​​/perl 脚本.

I'd be happy with either a tool I can drop in and use (ideally cross-platform), or a ruby/perl script I can make part of my data loading process.

推荐答案

你可以使用GNU iconv:

You can use GNU iconv:

$ iconv -f UTF-8 your_file -o /dev/null; echo $?

或者使用旧版本的 iconv,例如在 macOS 上:

Or with older versions of iconv, such as on macOS:

$ iconv -f UTF-8 your_file > /dev/null; echo $?

如果文件可以转换成功,该命令将返回 0,否则返回 1.此外,它还会打印出出现无效字节序列的字节偏移量.

The command will return 0 if the file could be converted successfully, and 1 if not. Additionally, it will print out the byte offset where the invalid byte sequence occurred.

编辑:输出编码不必指定,假定为UTF-8.

Edit: The output encoding doesn't have to be specified, it will be assumed to be UTF-8.

这篇关于如何检查文件是否为有效的 UTF-8?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆