如何检查文件是否为有效的UTF-8? [英] How to check whether a file is valid UTF-8?

查看：334 发布时间：2020/7/13 2:32:44 validation utf-8 internationalization

本文介绍了如何检查文件是否为有效的UTF-8?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在处理一些本应是有效UTF-8但不是的数据文件，这导致解析器(不在我的控制之下)失败.我想添加一个阶段来对UTF-8格式正确的数据进行预验证，但是我还没有找到一个实用工具来帮助实现此目的.

I'm processing some data files that are supposed to be valid UTF-8 but aren't, which causes the parser (not under my control) to fail. I'd like to add a stage of pre-validating the data for UTF-8 well-formedness, but I've not yet found a utility to help do this.

在W3C上有一个网络服务无效，并且我已经找到了仅Windows验证

There's a web service at W3C which appears to be dead, and I've found a Windows-only validation tool that reports invalid UTF-8 files but doesn't report which lines/characters to fix.

我对我可以插入并使用的工具(最好是跨平台)或我可以参与数据加载过程的ruby/perl脚本感到满意.

I'd be happy with either a tool I can drop in and use (ideally cross-platform), or a ruby/perl script I can make part of my data loading process.

推荐答案

您可以使用GNU iconv:

You can use GNU iconv:

$ iconv -f UTF-8 your_file -o /dev/null; echo $?

或与较旧版本的iconv一起使用，例如在macOS上:

Or with older versions of iconv, such as on macOS:

$ iconv -f UTF-8 your_file > /dev/null; echo $?

如果文件可以成功转换，该命令将返回0，否则返回1.此外，它将打印出发生无效字节序列的字节偏移量.

The command will return 0 if the file could be converted successfully, and 1 if not. Additionally, it will print out the byte offset where the invalid byte sequence occurred.

编辑:不必指定输出编码，它将被假定为UTF-8.

Edit: The output encoding doesn't have to be specified, it will be assumed to be UTF-8.

这篇关于如何检查文件是否为有效的UTF-8?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何检查文件是否为有效的UTF-8? [英] How to check whether a file is valid UTF-8?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何检查文件是否为有效的UTF-8? [英] How to check whether a file is valid UTF-8?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭