Java 文本文件编码 [英] Java Text File Encoding

查看:52
本文介绍了Java 文本文件编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文本文件,它可以是 ANSI(带有 ISO-8859-2 字符集)、UTF-8、UCS-2 大端或小端.

I have a text file and it can be ANSI (with ISO-8859-2 charset), UTF-8, UCS-2 Big or Little Endian.

有没有办法检测文件的编码以正确读取它?

Is there any way to detect the encoding of the file to read it properly?

或者是否可以在不提供编码的情况下读取文件?(并按原样读取文件)

Or is it possible to read a file without giving the encoding? (and it reads the file as it is)

(有几个程序可以检测和转换文本文件的编码/格式.)

(There are several program that can detect and convert encoding/format of text files.)

推荐答案

UTF-8 和 UCS-2/UTF-16 可以通过 字节顺序标记 位于文件的开头.如果存在这种情况,那么可以肯定该文件采用该编码 - 但这并不是绝对确定的.您可能还会发现文件采用其中一种编码,但没有字节顺序标记.

UTF-8 and UCS-2/UTF-16 can be distinguished reasonably easily via a byte order mark at the start of the file. If this exists then it's a pretty good bet that the file is in that encoding - but it's not a dead certainty. You may well also find that the file is in one of those encodings, but doesn't have a byte order mark.

我对 ISO-8859-2 了解不多,但如果几乎每个文件都是该编码的有效文本文件,我也不会感到惊讶.您能做的最好的事情就是启发式地检查它.确实,讨论它的 维基百科页面 表明只有字节 0x7f 是无效的.

I don't know much about ISO-8859-2, but I wouldn't be surprised if almost every file is a valid text file in that encoding. The best you'll be able to do is check it heuristically. Indeed, the Wikipedia page talking about it would suggest that only byte 0x7f is invalid.

不知道按原样"读取文件并输出文本 - 文件是字节的序列,因此您必须应用字符编码才能对这些字节进行解码成字符.

There's no idea of reading a file "as it is" and yet getting text out - a file is a sequence of bytes, so you have to apply a character encoding in order to decode those bytes into characters.

这篇关于Java 文本文件编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆