在java源文件中读取时使用什么字符集? [英] What charset to use when reading in a java source file?

查看:143
本文介绍了在java源文件中读取时使用什么字符集?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在读这个

源代码应该以UTF-8格式保存

我正在使用eclipse编译器lib但需要阅读一些java源文件将其提供给该库。它似乎可以以该帖子的不同格式存储。

and I am using the eclipse compiler lib but need to read some java source files in to feed it to that library. IT seems it can be stored in different formats from that post.

我是否可以使用一个Charset来读取它,因此它每次都有效。 Charset.forName(UTF-8)也许?

Is there one Charset I can use to read it in so it works every time. Charset.forName("UTF-8") maybe?

谢谢,
Dean

thanks, Dean

推荐答案

字符编码各不相同



任何工具都可以用任何编码编写Java源代码。即使是.java文件的想法也不是由 Java语言规范定义的。任何 IDE 都可以以任何方式保存Java源代码编码。

Character encodings vary

Any tool can write Java source code in any encoding. Even the idea of .java file is not defined by the Java Language Spec. Any IDE can persist Java source code any way it wants with any encoding.

这些工具负责最终在编译器工具链中提供符合Unicode标准的字符串。他们如何收集和保存源代码取决于特定的工具。

The tools are responsible for ultimately providing a Unicode-compliant stream of characters into the compiler toolchain. How they collect and persist the source code is up to the particular tools.

Java语言规范在第3章词汇结构

The Java Language Specification states in Chapter 3 Lexical Structure:


使用Unicode字符集编写程序。有关此字符集及其相关字符编码的信息,请访问 http://www.unicode.org/

因此,大概是Java源代码文件会使用Unicode中常见的字符编码之一,例如 UTF-8 UTF-16 或UCS-2。

So presumably a Java source code file would use one of character encodings common with Unicode such as UTF-8, UTF-16, or UCS-2.

第3.2节词汇翻译 提到Java程序可以使用诸如<之类的编码嵌入Unicode转义符的href =https://en.wikipedia.org/wiki/ASCII =nofollow> ASCII :


形式为\uxxxx的Unicode转义,其中xxxx是十六进制值,代表UTF-16代码单元w软管编码是xxxx。

Unicode escape of the form \uxxxx, where xxxx is a hexadecimal value, represents the UTF-16 code unit whose encoding is xxxx.

虽然UTF-8在我的经验中很常见,但这不是唯一可能的编码。您必须知道或猜测任何特定源文件的编码,并且必须考虑扩展任何Unicode转义。

While UTF-8 is common in my experience, that is not the only possible encoding. You must know or guess the encoding of any particular source file, and you must account for expanding any Unicode escapes.

顺便提一下,至少在Oracle JDK中,字节顺序标记(BOM) JDK-4508058 )永远不会修复(由于向后兼容性问题)。

By the way, note that at least in the Oracle JDK, the byte order mark (BOM) optional to UTF-8 files is not allowed in Java due to a bug (JDK-4508058) that will never be fixed (because of backward-compatibility concerns).

另请注意行终止符可能会有所不同:ASCII字符CR(CARRIAGE RETURN),或LF(LINE FEED)或CR LF。

Also note that line terminators may vary: the ASCII characters CR (CARRIAGE RETURN), or LF (LINE FEED), or CR LF.

空格各不相同:SPACE(SP),CHARACTER TABULATION(HT) (水平制表符),FORM FEED(FF)和行终止符。

White space varies: SPACE (SP), CHARACTER TABULATION (HT) (horizontal tab), FORM FEED (FF), and line terminators.

阅读规格了解更多细节。例如,关于SUBSTITUTE字符:

Read the spec for additional details. For example, regarding the SUBSTITUTE character:


作为与某些操作系统兼容的特殊让步,ASCII SUB字符(\ u001a,如果它是转义输入流中的最后一个字符,则忽略或控制-Z)。

As a special concession for compatibility with certain operating systems, the ASCII SUB character (\u001a, or control-Z) is ignored if it is the last character in the escaped input stream.



关于字符编码



确保您了解Unicode和字符编码的基础知识。最佳起点:每个软件开发人员绝对最低,绝对必须知道Unicode和字符集(否) Joel Spolsky的借口!)

即使是假设的规则如每个.java文件一个公共类可以由特定工具而不是Java本身定义。用于Java回退的 CodeWarrior 工具 - 当每个文件支持多个类时。

Even supposed rules such as "one public class per .java file" may be defined by particular tools rather than by Java itself. The CodeWarrior tools for Java way-back-when supported multiple classes per file.

这篇关于在java源文件中读取时使用什么字符集?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆