Java-如何验证泰语字符是否从UTF-8正确编码到TIS620 [英] Java- How to verify if Thai characters are encoded correctly from UTF-8 to TIS620

查看:191
本文介绍了Java-如何验证泰语字符是否从UTF-8正确编码到TIS620的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

获取UTF-8中的输入字符串,我应用了TIS620编码并从中创建了新字符串,现在如何保留字节?因为UTF-8表示3个字节的Thai char,而1个字节表示TIS620。我要求后端系统仅将字符串中的字符存储为1个字节,因此默认UTF-8会将其破坏。


  1. 如何转换字符串从UTF-8到TIS620的字符编码?

  2. 如何在将字节传递给后端系统时保留字节大小?

  3. 如果将字符串重新分配给new String,是否保留字符编码或再次将其转换为UTF-16(Java默认值)?

  4. 在Java中是否可能?任何可以集成的lib / utility?

我尝试了以下代码,并可以检查TIS620后的字节数是否与字符匹配计数ie1字节/字符。但是,如果encodeString获得了新的String分配,它会散发出TIS620格式吗? with-encoding-utf-8-to-tis620-thai-encoding-in-java-what-are-th>使用Java将UTF-8编码的字符串转换为TIS620(泰式编码)。有什么方法?并且是否有任何数据丢失?)

  public String encode(){
try {
字符串输入=ใบใบใบใบ;
字节[] encodeBytes = input.getBytes( TIS620);
String encodingString =新的String(encodedBytes, TIS620);
} catch(UnsupportedEncodingException e){
//编码失败
}
}

预期结果是,如果我将5个泰语字符从UTF-8格式转换为TIS620,那么应该将字节数转换并保留从15(UTF-8)到5(TIS620)吗?

解决方案

Java中的 String 始终以UTF-16编码,无论如何建造了。或换种说法:只要有了 String 对象,就不必关心它具有哪种编码。仅当您想返回 byte [] (或 OutputStream



这是正确的,几乎可以肯定是您想做的。您不应该尝试解决这个问题。



如果您需要将字符串写入磁盘或将其发送到某些系统中。特定的编码,则可以像在示例代码中一样使用 getBytes() String 中获取编码数据



换句话说:


  1. A String Java 中的对象不能具有TIS620编码。 byte [] 可以包含TIS620编码的数据,您可以使用 String 使用 .getBytes( TIS620)

  2. 如果您传递编码后的 byte [] 到另一个系统,它将具有正确的字节大小,只是因为它是使用正确的编码创建的。

  3. String 始终使用UTF-16。从UTF-8数据和TIS620数据中创建内容为ใบใบใบใบ的 String 会产生完全相同的String对象,无法知道使用了哪种编码来创建它们。

  4. InputStreamReader OutputStreamWriter 和类似的类也可以通过了一种编码,分别使用该编码进行解码/编码。除此之外,不需要任何特殊处理。


Get input string in UTF-8, I applied TIS620 encoding and created new string from it now how to retain the bytes? since UTF-8 represents Thai char in 3 bytes where as TIS620 in 1 byte. I've requirement where the backend system stores characters in string as 1 byte only so default UTF-8 breaks it.

  1. How to convert String character encoding from UTF-8 to TIS620?
  2. How to retain the byte size while passing it to backend system?
  3. If the string is reassigned to new String , Does character encoding is retained or it again gets converted to UTF-16 (Java default)?
  4. Is it possible in Java? Any lib/utility which can be integrated?

I've tried below code and can check that post TIS620 the byte count matches the character count i.e.1 byte/char. But if encodedString gets new String assignment will it loose TIS620 format?

(Convert String with encoding UTF-8 to TIS620 (Thai encoding) in Java.What are the ways to do it and it there any data loss?)

public String encode() {
        try {
String input = " "ใบใบใบใบ"";
            byte [] encodedBytes= input.getBytes("TIS620");
            String encodedString = new String(encodedBytes,"TIS620");
            }catch (UnsupportedEncodingException e){
            //Encoding failed           
        }
    }

Expected result is, if I convert 5 Thai character from UTF-8 format to TIS620 the byte count should be converted and retained from 15 (UTF-8) to 5 (TIS620)?

解决方案

A String in Java is always encoded in UTF-16, no matter how it was constructed. Or put differently: as soon as you have a String object, you should not care about which encoding it has. The encoding only comes back into the picture once you want to go back towards a byte[] (or OutputStream or the like).

This is correct and almost certainly exactly what you want to do. You should not try to work around that fact.

If you need to write the string to disk or send it to some other system in some specific encoding then you can get that encoded data from the String by using getBytes() as you did in your sample code.

In other words:

  1. A String object in Java can not "have TIS620" encoding. A byte[] can contain TIS620 encoded data and you create that from a String using .getBytes("TIS620").
  2. If you pass the encoded byte[] to the other system, it will have the correct byte size, simply because it was created with the correct encoding.
  3. String always uses UTF-16. Creating a String with the content "ใบใบใบใบ" from UTF-8 data and from TIS620 data will produce exactly identical String objects, there's no way to know what encoding was used to create them.
  4. InputStreamReader, OutputStreamWriter and comparable classes can also be passed an encoding to decode/encode with that encoding respectively. Other than that, no special handling is required.

这篇关于Java-如何验证泰语字符是否从UTF-8正确编码到TIS620的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆