javadoc中的Unicode和注释? [英] Unicode in javadoc and comments?

查看:164
本文介绍了javadoc中的Unicode和注释?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

某些编译器在JavaDoc和源代码注释中对非ASCII字符失败。相对于Java源文件中的Unicode,当前(Java 7)和未来(Java 8和更高版本)的做法是什么? IcedTea,OpenJDK和其他Java环境之间是否存在差异,什么是语言规范?是否所有非ASCII字符在JavaDoc中使用HTML &escape; 类似代码转义?



更新:注释表示可以使用任何字符设置,并且在编译时,需要指示在源文件中使用什么字符集。我将研究这个,并将寻找关于如何通过Ant,Eclipse和Maven配置这个的细节。

解决方案


有些编译器在JavaDoc和源代码注释中对非ASCII字符失败。


这可能是因为编译器假定输入是UTF-8,并且源文件中存在无效的UTF-8序列。这些似乎是在您的源代码编辑器中的注释是不相关的,因为词法分析器(区分注释与其他令牌)永远不会运行。在工具尝试在字符串运行之前将字节转换为字符时发生故障。






man javac javadoc

  -encoding名称
指定源文件编码名称,例如
EUCJIS / SJIS。如果未指定此选项,则使用plat-
格式默认转换器。

所以运行 javadoc / p>

  javadoc -encoding< encoding-name> ... 
< encoding-name> $ c>使用您用于源文件的编码应该使它使用正确的编码。



如果您在一个您需要一起编译的源文件组,您需要先解决这个问题,并针对所有源文件确定一个统一的编码。你应该真的只使用UTF-8或坚持使用ASCII。







(Java 7)和未来(Java 8和更高版本)对Java源代码文件中的Unicode的实践?


Java中的源文件


  1. 收集字节

  2. 将字节转换为字符(UTF-

  3. 替换'\\' 'u'后跟四个十六进制数字,代码单元对应于那些十六进制数字。如果有\u后面没有四个十六进制数字,则会出错。

  4. 将字符转换为标记。 li>
  5. 将标记解析成类。

当前和以前的做法是步骤2,到UTF-16代码单元,是由加载编译单元(源文件)的工具,但事实上命令行接口的标准是使用 -encoding



转换发生后,语言要求将 \\\ꯍ 样式序列转换为UTF-16



例如:

  int a; 
\\\a = 42;

是一组有效的Java语句。
任何java源代码工具必须在将字节转换为字符之后但在解析之前,查找\\\ꯍ序列并将其转换为此代码转换为

  int a; 
a = 42;



这个过程看起来像


  1. Get字节: [105,110,116,32,97,59,10,92,117,48,48,54,49,32,61,32,52,50, 59]

  2. 将字节转换为字符: ['i','n','t',' ,';','\\\
    ','\\','u','0','0','6','1','','=','' ','2',';']

  3. 替换unicode转义: ['i','n','t' ,'','a',';','\\\
    ',a,'','=','','4','2',';']
    / li>
  4. Lex: [int,a,;,a,=,42,;]

  5. 解析:(Block(Variable(Type int)(Identifiera))(Assign(Reference a) ))








除了HTML特殊字符,例如

,所有非ASCII字符都必须在JavaDoc中转义。 code>'<'
您希望在文档中按字面显示。您可以在javadoc注释中使用 \\\ꯍ 序列。
Java进程 \u .... 之前解析源文件,以便它们可以出现在字符串,注释,真正的任何地方。这就是为什么

  System.out.println(Hello,world!\\\"); 



是有效的Java语句。

  / ** @return \\\θ以弧度表示* / 

等效于

  / ** @以θ为单位返回θ/ / 

关于javadoc。








您可以使用 // 在java中的注释,但Javadoc只在 /**...*/ 注释文档。 // 注释不是元数据。



Java处理 \\\ꯍ 序列是虽然

  //评论text.\\\
 System.out.println文本); 

看起来像单行注释,许多IDE会突出显示它,不是。


Some compilers failed on non-ASCII characters in JavaDoc and source code comments. What is the current (Java 7) and future (Java 8 and beyond) practices with respect to Unicode in Java source files? Are there differences between IcedTea, OpenJDK, and other Java environments, and what is dictated the the language specification? Should all non-ASCII characters be escaped in JavaDoc with HTML &escape;-like codes? But what would be the Java // comment equivalent?

Update: comments indicate that one can use any character set, and that upon compiling one needs to indicate what char set is used in the source file. I will look into this, and will be looking for details on how to configure this via Ant, Eclipse, and Maven.

解决方案

Some compilers failed on non-ASCII characters in JavaDoc and source code comments.

This is likely because the compiler assumes that the input is UTF-8, and there are invalid UTF-8 sequences in the source file. That these appear to be in comments in your source code editor is irrelevant because the lexer (which distinguishes comments from other tokens) never gets to run. The failure occurs while the tool is trying to convert bytes into chars before the lexer runs.


The man page for javac and javadoc say

-encoding name
          Specifies  the  source  file  encoding   name,   such   as
          EUCJIS/SJIS.   If  this option is not specified, the plat-
          form default converter is used.

so running javadoc with the encoding flag

javadoc -encoding <encoding-name> ...

after replacing <encoding-name> with the encoding you've used for your source files should cause it to use the right encoding.

If you've got more than one encoding used within a group of source files that you need to compile together, you need to fix that first and settle on a single uniform encoding for all source files. You should really just use UTF-8 or stick to ASCII.


What is the current (Java 7) and future (Java 8 and beyond) practices with respect to Unicode in Java source files?

The algorithm for dealing with a source file in Java is

  1. Collect bytes
  2. Convert bytes to chars (UTF-16 code units) using some encoding.
  3. Replace all sequences of '\\' 'u' followed by four hex digits with the code-unit corresponding to those hex-digits. Error out if there is a "\u" not followed by four hex digits.
  4. Lex the chars into tokens.
  5. Parse the tokens into classes.

The current and former practice is that step 2, converting bytes to UTF-16 code units, is up to the tool that is loading the compilation unit (source file) but the de facto standard for command line interfaces is to use the -encoding flag.

After that conversion happens, the language mandates that \uABCD style sequences are converted to UTF-16 code units (step 3) before lexing and parsing.

For example:

int a;
\u0061 = 42;

is a valid pair of Java statements. Any java source code tool must, after converting bytes to chars but before parsing, look for \uABCD sequences and convert them so this code is converted to

int a;
a = 42;

before parsing. This happens regardless of where the \uABCD sequence occurs.

This process looks something like

  1. Get bytes: [105, 110, 116, 32, 97, 59, 10, 92, 117, 48, 48, 54, 49, 32, 61, 32, 52, 50, 59]
  2. Convert bytes to chars: ['i', 'n', 't', ' ', 'a', ';', '\n', '\\', 'u', '0', '0', '6', '1', ' ', '=', ' ', '4', '2', ';']
  3. Replace unicode escapes: ['i', 'n', 't', ' ', 'a', ';', '\n', a, ' ', '=', ' ', '4', '2', ';']
  4. Lex: ["int", "a", ";", "a", "=", "42", ";"]
  5. Parse: (Block (Variable (Type int) (Identifier "a")) (Assign (Reference "a") (Int 42)))


Should all non-ASCII characters be escaped in JavaDoc with HTML &escape;-like codes?

No need except for HTML special characters like '<' that you want to appear literally in the documentation. You can use \uABCD sequences inside javadoc comments. Java process \u.... before parsing the source file so they can appear inside strings, comments, anywhere really. That's why

System.out.println("Hello, world!\u0022);

is a valid Java statement.

/** @return \u03b8 in radians */

is equivalent to

/** @return θ in radians */

as far as javadoc is concerned.


But what would be the Java // comment equivalent?

You can use // comments in java but Javadoc only looks inside /**...*/ comments for documentation. // comments are not metadata carrying.

One ramification of Java's handling of \uABCD sequences is that although

// Comment text.\u000A System.out.println("Not really comment text");

looks like a single line comment, and many IDEs will highlight it as such, it is not.

这篇关于javadoc中的Unicode和注释?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆