在Java中获取文件的编码 [英] Get file's encoding in Java

查看:470
本文介绍了在Java中获取文件的编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述


可能重复:

Java:如何确定流的正确字符集编码

用户将CSV文件上传到服务器,服务器需要检查CSV文件是否编码为UTF-8。如果需要通知用户,他上传了错误的编码文件。问题是如何检测文件用户上传的是UTF-8编码?后端是用Java编写的。所以有人得到建议?

User will upload a CSV file to the server, server need to check if the CSV file is encoded as UTF-8. If so need to inform user, (s)he uploaded a wrong encoding file. The problem is how to detect the file user uploaded is UTF-8 encoding? The back end is written in Java. So anyone get the suggestion?

推荐答案

至少在一般情况下,没有办法确定什么编码用于文件 - 你可以做的最好的是基于启发式的合理猜测。你可以消除一些可能性,但是最好的是,在没有确认任何一个的情况下,你可能会缩小可能性。例如,大多数ISO 8859变体允许任何字节值(或字节值模式),因此几乎任何内容都可以使用几乎任何ISO 8859变体进行编码(而且我只使用几乎不要紧张,不能确定你可以消除任何可能性)。

At least in the general case, there's no way to be certain what encoding is used for a file -- the best you can do is a reasonable guess based on heuristics. You can eliminate some possibilities, but at best you're narrowing down the possibilities without confirming any one. For example, most of the ISO 8859 variants allow any byte value (or pattern of byte values), so almost any content could be encoded with almost any ISO 8859 variant (and I'm only using "almost" out of caution, not any certainty that you could eliminate any of the possibilities).

然而,你可以做一些合理的猜测。例如,一个以UTF-8编码的BOM(EF BB BF)三个字符开头的文件,可以安全地假设它是UTF-8。同样,如果您看到如下所示的序列:110xxxxx 10xxxxxx,这是一个很公平的猜测,你看到的是用UTF-8编码的。如果您看到像110xxxxx 110xxxxx这样的序列,您可以消除某些东西(正确地)UTF-8的可能性。 (110xxxxx是序列的引导字节,其中必须后面是非引导字节,而不是正确编码的UTF-8中的另一个引导字节)。

You can, however, make some reasonable guesses. For example, a file that start out with the three characters of a UTF-8 encoded BOM (EF BB BF), it's probably safe to assume it's really UTF-8. Likewise, if you see sequences like: 110xxxxx 10xxxxxx, it's a pretty fair guess that what you're seeing is encoded with UTF-8. You can eliminate the possibility that something is (correctly) UTF-8 enocded if you ever see a sequence like 110xxxxx 110xxxxx. (110xxxxx is a lead byte of a sequence, which must be followed by a non-lead byte, not another lead byte in properly encoded UTF-8).

这篇关于在Java中获取文件的编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆