如何检查文件是否为有效的UTF-8? [英] How to check whether a file is valid UTF-8?

查看:334
本文介绍了如何检查文件是否为有效的UTF-8?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理一些本应是有效UTF-8但不是的数据文件,这导致解析器(不在我的控制之下)失败.我想添加一个阶段来对UTF-8格式正确的数据进行预验证,但是我还没有找到一个实用工具来帮助实现此目的.

I'm processing some data files that are supposed to be valid UTF-8 but aren't, which causes the parser (not under my control) to fail. I'd like to add a stage of pre-validating the data for UTF-8 well-formedness, but I've not yet found a utility to help do this.

在W3C上有一个网络服务无效,并且我已经找到了仅Windows验证

There's a web service at W3C which appears to be dead, and I've found a Windows-only validation tool that reports invalid UTF-8 files but doesn't report which lines/characters to fix.

我对我可以插入并使用的工具(最好是跨平台)或我可以参与数据加载过程的ruby/perl脚本感到满意.

I'd be happy with either a tool I can drop in and use (ideally cross-platform), or a ruby/perl script I can make part of my data loading process.

推荐答案

您可以使用GNU iconv:

You can use GNU iconv:

$ iconv -f UTF-8 your_file -o /dev/null; echo $?

或与较旧版本的iconv一起使用,例如在macOS上:

Or with older versions of iconv, such as on macOS:

$ iconv -f UTF-8 your_file > /dev/null; echo $?

如果文件可以成功转换,该命令将返回0,否则返回1.此外,它将打印出发生无效字节序列的字节偏移量.

The command will return 0 if the file could be converted successfully, and 1 if not. Additionally, it will print out the byte offset where the invalid byte sequence occurred.

编辑:不必指定输出编码,它将被假定为UTF-8.

Edit: The output encoding doesn't have to be specified, it will be assumed to be UTF-8.

这篇关于如何检查文件是否为有效的UTF-8?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆