了解典型Java Web应用程序中的字符编码 [英] Understanding character encoding in typical Java web app

查看:146
本文介绍了了解典型Java Web应用程序中的字符编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

一些伪代码:

  String a =一堆文字; // UTF-16 
saveTextInDb(a); //写入Oracle VARCHAR(15)列
String b = readTextFromDb(); // UTF-16
out.write(b); //写入http响应

当您保存Java String (UTF-16)到Oracle VARCHAR(15)Oracle还将其存储为UTF-16? Oracle VARCHAR的长度是指Unicode字符数(而不是字节数)?



当我们写 b ServletResponse 这是写为UTF-16还是我们默认转换为另一种编码,如UTF-8?


< Oracle从数据库存储(并稍后检索)Unicode文本的能力仅依赖于数据库的字符集(通常在数据库创建期间指定)。选择AL32UTF8作为字符集建议用于在CHAR数据类型(包括VARCHAR / VARCHAR2)中存储Unicode文本,因为它将使您能够访问所有Unicode代码点,同时不会占用大量的存储空间,与AL16UTF16 / AL32UTF32。



假设这是完成的,它是Oracle JDBC驱动程序,负责将UTF-16编码数据转换为AL32UTF8。编码之间的自动转换也发生在从数据库读取数据时。为了回答关于VARCHAR的字节长度的查询,Oracle中VARCHAR2列的定义涉及字节语义 - VARCHAR2(n)用于定义可以存储n个字节的列(这是由NLS_LENGTH_SEMANTICS参数指定的默认行为的数据库);如果需要根据字符VARCHAR2(n CHAR)定义大小。



写入ServletResponse对象的数据的编码取决于默认字符编码,除非通过 ServletResponse.setCharacterEncoding() ServletResponse.setContentType() API调用。总之,对于涉及Oracle数据库的完整Unicode解决方案,必须具有

的知识


  1. 输入数据的编码通过ServletRequest对象读取的数据的编码)。这可以通过在HTML表单中指定接受的编码,通过 accept-charset属性。如果编码未知,应用程序可能会尝试通过 ServletRequest.setCharacterEncoding()方法。此方法不会更改流中字符的现有编码。如果输入流是ISO-Latin1,指定不同的编码很可能导致抛出异常。知道编码很重要,因为如果流的内容被视为字符基元或字符串,Java运行时库将需要知道流的原始编码。显然,当您调用 ServletRequest.getParameter 或类似的方法来处理流和返回String对象。解码过程将导致在平台编码中创建字符(这是UTF-16)。

  2. 从流读取的数据的编码,在JVM中。这是非常重要的,因为从流读取的数据的编码,不能更改。然而,有一个解码过程,将支持的编码中的字符转换为UTF-16字符,只要这样的数据作为字符原语或字符串访问。另一方面,新的String对象可以使用定义的编码创建。这在将流的内容写入另一个流(例如HttpServletResponse对象的输出流)时很重要。如果输入流的内容被视为字节序列,而不是字符或字符串,则JVM将不进行解码操作。这意味着如果未创建中间字符或String对象,则不能更改写入输出流的字节。否则,很可能输出流的内容将被格式化,并且由相应的解码器不正确地解析。用更简单的词语,




    • 如果将一个String对象或字符写入servlet的输出流,那么必须指定浏览器必须使用解码响应。适当的编码器可用于对所需响应中指定的字符序列进行编码。

    • 如果编写的字符序列将被解释为字符,则指定编码在HTTP头中必须知道之前

    • 如果一个字符串被解析为一个字节序列(对于图像和其他二进制数据),那么,编码是无关紧要的。


  3. Oracle实例的数据库字符集。如前所述,数据将以定义的字符集(对于CHAR数据类型)存储在Oracle数据库中。 Oracle JDBC驱动程序负责在CHAR和NCHAR数据类型的UTF-16和AL32UTF8(本例中为数据库字符集)之间转换数据。当您调用 resultSet.getString()时,JDBC驱动程序将返回一个UTF-16字符的字符串。反之亦然,当你发送数据到数据库。如果使用另一个数据库字符集,则JDBC驱动程序将透明地执行额外的转换级别(从UTF-16到UTF-8到数据库字符集)。


Some pseudocode:

String a = "A bunch of text"; //UTF-16
saveTextInDb(a); //Write to Oracle VARCHAR(15) column
String b = readTextFromDb(); //UTF-16
out.write(b); //Write to http response

When you save the Java String (UTF-16) to Oracle VARCHAR(15) does Oracle also store this as UTF-16? Does the length of an Oracle VARCHAR refer to number of Unicode characters (and not number of bytes)?

When we write b to the ServletResponse is this being written as UTF-16 or are we by default converting to another encoding like UTF-8?

解决方案

The ability of Oracle to store (and later retrieve) Unicode text from the database relies only on the character set of the database (usually specified during database creation). Choosing AL32UTF8 as the character set is recommended for storage of Unicode text in CHAR datatypes (including VARCHAR/VARCHAR2), for it will enable you to access all of the Unicode codepoints while not consuming a lot of storage space compared to other encodings like AL16UTF16/AL32UTF32.

Assuming this is done, it is the Oracle JDBC driver that is responsible for conversion of UTF-16 encoded data into AL32UTF8. This "automatic" conversion between encodings also happens when data is read from the database. To answer the query on byte length of VARCHAR, the definition of a VARCHAR2 column in Oracle involves byte semantics - VARCHAR2(n) is used to define a column that can store n bytes (this is the default behavior, as specified by the NLS_LENGTH_SEMANTICS parameter of the database); if you need to define the size based on characters VARCHAR2(n CHAR) is to be used.

The encoding of the data written to the ServletResponse object, depends on the default character encoding, unless this is specified via the ServletResponse.setCharacterEncoding() or ServletResponse.setContentType() API calls. All in all, for a complete Unicode solution involving an Oracle database, one must have knowledge of

  1. The encoding of the incoming data (i.e. the encoding of the data read via the ServletRequest object). This can be done via specifying the accepted encoding in the HTML forms via the accept-charset attribute. If the encoding is unknown, the application could attempt to set it to a known value via the ServletRequest.setCharacterEncoding() method. This method doesn't change the existing encoding of characters in the stream. If the input stream is in ISO-Latin1, specifying a different encoding will most likely result in an exception being thrown. Knowing the encoding is important, since the Java runtime libraries will require knowledge of the original encoding of the stream, if the contents of the stream are to be treated as character primitives or Strings. Apparently, this is required when you invoke ServletRequest.getParameter or similar methods that will process the stream and return String objects. The decoding process will result in creation of characters in the platform encoding (this is UTF-16).
  2. The encoding of the data read from streams, as opposed to data created with in the JVM. This is quite important, since the encoding of data read from streams, cannot be changed. There is however, a decoding process that will convert characters in supported encodings to UTF-16 characters, whenever such data is accessed as a character primitive or as a String. New String objects on the other hand, can be created with a defined encoding. This matters when you write the contents of the stream out onto another stream (the HttpServletResponse object's output stream for instance). If the contents of the input stream are being treated as a sequence of bytes, and not as characters or Strings, then no decoding operation will be undertaken by the JVM. This would imply that the bytes written to the output stream must not be altered if intermediate character or String objects are not created. Otherwise, it is quite possible that the contents of the output stream will be malformed and parsed incorrectly by a corresponding decoder. In simpler words,

    • if one is writing String objects or characters to the servlet's output stream, then one must specify the encoding that the browser must use to decode the response. Appropriate encoders might be used to encode the sequence of characters as specified in the desired response.
    • if one is writing a sequence of bytes that will be interpreted as characters, then the encoding to be specified in the HTTP header must be known before hand
    • if one is writing a sequence of bytes that will be parsed as a sequence of bytes (for images and other binary data), then the concept of encoding is immaterial.
  3. The database character set of the Oracle instance. As indicated previously, data will be stored in the Oracle database, in the defined character set (for CHAR datatypes). The Oracle JDBC driver takes care of conversion of data between UTF-16 and AL32UTF8 (the database character set in this case) for CHAR and NCHAR datatypes. When you invoke resultSet.getString(), a String with UTF-16 characters is being returned by the JDBC driver. The converse is true, when you send data to the database too. If another database character set is used, an additional level of conversion (from the UTF-16 to UTF-8 to the database character set) is performed transparently by the JDBC driver.

这篇关于了解典型Java Web应用程序中的字符编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆