如何检查文件是否为有效的UTF-8? [英] How to check whether a file is valid UTF-8?
问题描述
我正在处理一些本应是有效UTF-8但不是的数据文件,这导致解析器(不在我的控制之下)失败.我想添加一个阶段来对UTF-8格式正确的数据进行预验证,但是我还没有找到一个实用工具来帮助实现此目的.
I'm processing some data files that are supposed to be valid UTF-8 but aren't, which causes the parser (not under my control) to fail. I'd like to add a stage of pre-validating the data for UTF-8 well-formedness, but I've not yet found a utility to help do this.
在W3C上有一个网络服务无效,并且我已经找到了仅Windows验证
There's a web service at W3C which appears to be dead, and I've found a Windows-only validation tool that reports invalid UTF-8 files but doesn't report which lines/characters to fix.
我对我可以插入并使用的工具(最好是跨平台)或我可以参与数据加载过程的ruby/perl脚本感到满意.
I'd be happy with either a tool I can drop in and use (ideally cross-platform), or a ruby/perl script I can make part of my data loading process.
推荐答案
您可以使用GNU iconv:
You can use GNU iconv:
$ iconv -f UTF-8 your_file -o /dev/null; echo $?
或与较旧版本的iconv一起使用,例如在macOS上:
Or with older versions of iconv, such as on macOS:
$ iconv -f UTF-8 your_file > /dev/null; echo $?
如果文件可以成功转换,该命令将返回0,否则返回1.此外,它将打印出发生无效字节序列的字节偏移量.
The command will return 0 if the file could be converted successfully, and 1 if not. Additionally, it will print out the byte offset where the invalid byte sequence occurred.
编辑:不必指定输出编码,它将被假定为UTF-8.
Edit: The output encoding doesn't have to be specified, it will be assumed to be UTF-8.
这篇关于如何检查文件是否为有效的UTF-8?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!