获取Java中的文件编码 [英] Get file's encoding in Java

查看:56
本文介绍了获取Java中的文件编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

可能重复:
Java:如何确定的正确字符集编码流

用户将CSV文件上传到服务器,服务器需要检查CSV文件是否编码为UTF-8.如果需要,则通知用户,他上传了错误的编码文件.问题是如何检测用户上传的文件是UTF-8编码?后端是用Java编写的.所以有人得到建议吗?

User will upload a CSV file to the server, server need to check if the CSV file is encoded as UTF-8. If so need to inform user, (s)he uploaded a wrong encoding file. The problem is how to detect the file user uploaded is UTF-8 encoding? The back end is written in Java. So anyone get the suggestion?

推荐答案

至少在一般情况下,无法确定文件使用哪种编码-最好的办法是根据启发式.您可以消除一些可能性,但是充其量您是在不确认任何可能性的情况下缩小了可能性.例如,大多数ISO 8859变体都允许任何字节值(或字节值的模式),因此几乎任何内容都可以使用几乎任何ISO 8859变体进行编码(而我仅使用几乎",出于谨慎考虑,不确定是否可以消除任何可能性.

At least in the general case, there's no way to be certain what encoding is used for a file -- the best you can do is a reasonable guess based on heuristics. You can eliminate some possibilities, but at best you're narrowing down the possibilities without confirming any one. For example, most of the ISO 8859 variants allow any byte value (or pattern of byte values), so almost any content could be encoded with almost any ISO 8859 variant (and I'm only using "almost" out of caution, not any certainty that you could eliminate any of the possibilities).

但是,您可以做出一些合理的猜测.例如,以UTF-8编码的BOM表(EF BB BF)的三个字符开头的文件,可以肯定地认为它确实是UTF-8.同样,如果您看到如下序列:110xxxxx 10xxxxxx,则可以很公平地猜出您所看到的是使用UTF-8编码的.如果您看到110xxxxx 110xxxxx之类的序列,则可以消除(正确)使用UTF-8编码的可能性.(110xxxxx是序列的前导字节,必须 后跟非前导字节,而不是正确编码的UTF-8中的另一个前导字节.)

You can, however, make some reasonable guesses. For example, a file that start out with the three characters of a UTF-8 encoded BOM (EF BB BF), it's probably safe to assume it's really UTF-8. Likewise, if you see sequences like: 110xxxxx 10xxxxxx, it's a pretty fair guess that what you're seeing is encoded with UTF-8. You can eliminate the possibility that something is (correctly) UTF-8 enocded if you ever see a sequence like 110xxxxx 110xxxxx. (110xxxxx is a lead byte of a sequence, which must be followed by a non-lead byte, not another lead byte in properly encoded UTF-8).

这篇关于获取Java中的文件编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆