如何正确计算字符串字节? [英] How to count String bytes properly?

查看:137
本文介绍了如何正确计算字符串字节?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

A 的问题或使用从 getBytes方法不返回计数为两个字节的特殊字符.

A java string containing special chars such as ç takes two bytes of size in each special char, but String length method or getting the length of it with the byte array returned from getBytes method doesn't return special chars counted as two bytes.

如何正确计算字符串中的字节数?

How can I count correctly the number of bytes in a String?

示例:

单词endereço应该使我返回9而不是8.

The word endereço should return me length 9 instead of 8.

推荐答案

endereço一词应返回9而不是8.

The word endereço should return me length 9 instead of 8.

如果您希望长度为8个字符的"endereço"字符串的大小为9个字节:7个ASCII字符和1个非ASCII字符,我想您想使用UTF-8字符集,对于ASCII表中包含的字符,使用1个字节,对其他字符集使用更多.

If you expect to have a size of 9 bytes for the "endereço" String that has a length of 8 characters : 7 ASCII characters and 1 not ASCII character, I suppose that you want to use UTF-8 charset that uses 1 byte for characters included in the ASCII table and more for the others.

但是使用字符串长度方法或使用字节获取它的长度 从getBytes方法返回的数组不返回特殊字符 计为两个字节.

but String length method or getting the length of it with the byte array returned from getBytes method doesn't return special chars counted as two bytes.


String length()方法不能回答以下问题:使用了多少字节?,但是回答:"有多少个" UTF-16代码单元或更多只是char包含在其中?"


String length() method doesn't answer to the question : how many bytes are used ? But answer to : "how many "UTF-16 code units" or more simply chars are contained in?"

String length() Javadoc:

String length() Javadoc :

返回此字符串的长度.长度等于个数 字符串中的Unicode代码单元.

Returns the length of this string. The length is equal to the number of Unicode code units in the string.


没有参数的byte[] getBytes()方法将String编码为字节数组.您可以使用返回数组的length属性来了解编码的String使用了多少字节,但是结果将取决于编码期间使用的字符集. 但是byte[] getBytes()方法不允许指定字符集:它使用平台的默认字符集.
因此,如果底层操作系统默认情况下使用的字符集不是您要使用的字符集(以字节为单位)编码,则使用它可能无法获得预期的结果.
此外,根据应用程序部署的平台,以字节为单位的字符串编码方式可能会发生变化.这可能是不希望的.
最后,如果无法将字符串编码为默认字符集,则行为未指定.
因此,应该非常谨慎地使用这种方法,或者根本不要使用这种方法.


The byte[] getBytes() method with no argument encodes the String into a byte array. You could use the length property of the returned array to know how many bytes are used by the encoded String but the result will depend on the charset used during the encoding. But the byte[] getBytes() method doesn't allow to specify the charset : it uses the platform's default charset.
So, using it may not give the expected result if the underlying OS uses by default a charset that is not which one that you want to use to encode your Strings in bytes.
Besides, according to the platform where the application is deployed, the way which the String are encoded in bytes may change. Which may be undesirable.
At last, if the String cannot be encoded in the default charset, the behavior is unspecified.
So, this method should be used with very caution or not used at all.

byte[] getBytes() Javadoc:

byte[] getBytes() Javadoc :

使用平台的字符串将此字符串编码为字节序列 默认字符集,将结果存储到新的字节数组中.

Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.

无法在字符串中编码此字符串时此方法的行为 未指定默认字符集. java.nio.charset.CharsetEncoder 当对编码过程有更多控制时,应使用类 必填.

The behavior of this method when this string cannot be encoded in the default charset is unspecified. The java.nio.charset.CharsetEncoder class should be used when more control over the encoding process is required.

在您的字符串示例"endereço"中,如果getBytes()返回一个大小为8而不是9的数组,则意味着您的操作系统默认不使用UTF-8,而是使用1字节固定宽度的字符集对于基于Windows OS的字符,例如ISO 8859-1及其派生字符集(例如windows-1252).

In your String example "endereço", if getBytes() returns a array with a size of 8 and not 9, it means that your OS doesn't use by default UTF-8 but a charset using 1 byte fixed width by character such as ISO 8859-1 and its derived charsets such as windows-1252 for Windows OS based.

要了解运行该应用程序的当前Java虚拟机的默认字符集,可以使用以下实用程序方法:Charset defaultCharset = Charset.defaultCharset().

To know the default charset of the current Java virtual machine where the application runs, you can use this utility method : Charset defaultCharset = Charset.defaultCharset().

解决方案

byte[] getBytes()方法带有另外两个非常有用的重载:

byte[] getBytes() method comes with two other very useful overloads :

  • byte[] java.lang.String.getBytes(String charsetName) throws UnsupportedEncodingException

byte[] java.lang.String.getBytes(Charset charset)

与没有参数的getBytes()方法相反,这些方法允许指定在字节编码期间使用的字符集.

Contrary to the getBytes() method with no argument, these methods allow to specify the charset to use during the byte encoding.

byte[] java.lang.String.getBytes(String charsetName) throws UnsupportedEncodingException Javadoc:

byte[] java.lang.String.getBytes(String charsetName) throws UnsupportedEncodingException Javadoc :

使用命名的字符集将此String编码为字节序列, 将结果存储到新的字节数组中.

Encodes this String into a sequence of bytes using the named charset, storing the result into a new byte array.

无法在字符串中编码此字符串时此方法的行为 给定字符集未指定. java.nio.charset.CharsetEncoder 当对编码过程有更多控制时,应使用类 必填.

The behavior of this method when this string cannot be encoded in the given charset is unspecified. The java.nio.charset.CharsetEncoder class should be used when more control over the encoding process is required.

byte[] java.lang.String.getBytes(Charset charset) Javadoc:

byte[] java.lang.String.getBytes(Charset charset) Javadoc :

使用给定的字符集将此String编码为字节序列, 将结果存储到新的字节数组中.

Encodes this String into a sequence of bytes using the given charset, storing the result into a new byte array.

此方法始终替换格式错误的输入和不可映射的字符 具有此字符集的默认替换字节数组的序列.这 更多控制权时应使用java.nio.charset.CharsetEncoder类 在编码过程中是必需的.

This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement byte array. The java.nio.charset.CharsetEncoder class should be used when more control over the encoding process is required.

您可以使用一个或另一个(虽然它们之间有些复杂)将您的String编码为带有 UTF-8 或任何其他字符集的字节数组,并为此获取其大小特定字符集.

You may use one or the other one (while there are some intricacies between them) to encode your String in a byte array with UTF-8 or any other charset and so get its size for this specific charset .

例如,要使用getBytes(String charsetName)获得UTF-8编码字节数组,可以执行以下操作:

For example to get an UTF-8 encoding byte array by using getBytes(String charsetName) you can do that :

String yourString = "endereço";
byte[] bytes = yourString.getBytes("UTF-8");
int sizeInBytes = bytes.length;

您将根据需要获得9个字节的长度.

And you will get a length of 9 bytes as you wish.

这是一个更全面的示例,其中显示了默认编码,使用默认字符集平台UTF-8UTF-16的字节编码:

Here is a more comprehensive example with default encoding displayed, byte encoding with default charset platform, UTF-8 and UTF-16 :

public static void main(String[] args) throws UnsupportedEncodingException {

    // default charset
    Charset defaultCharset = Charset.defaultCharset();
    System.out.println("default charset = " + defaultCharset);

    // String sample
    String yourString = "endereço";

    //  getBytes() with default platform encoding
    System.out.println("getBytes() with default charset, size = " + yourString.getBytes().length + System.lineSeparator());

    // getBytes() with specific charset UTF-8
    System.out.println("getBytes(\"UTF-8\"), size = " + yourString.getBytes("UTF-8").length);       
    System.out.println("getBytes(StandardCharsets.UTF_8), size = " + yourString.getBytes(StandardCharsets.UTF_8).length + System.lineSeparator());

    // getBytes() with specific charset UTF-16      
    System.out.println("getBytes(\"UTF-16\"), size = " + yourString.getBytes("UTF-16").length);     
    System.out.println("getBytes(StandardCharsets.UTF_16), size = " + yourString.getBytes(StandardCharsets.UTF_16).length);
}

基于Windows操作系统的计算机上的输出:

默认字符集= Windows-1252

default charset = windows-1252

getBytes(),默认字符集,大小= 8

getBytes() with default charset, size = 8

getBytes("UTF-8"),大小= 9

getBytes("UTF-8"), size = 9

getBytes(StandardCharsets.UTF_8),大小= 9

getBytes(StandardCharsets.UTF_8), size = 9

getBytes("UTF-16"),大小= 18

getBytes("UTF-16"), size = 18

getBytes(StandardCharsets.UTF_16),大小= 18

getBytes(StandardCharsets.UTF_16), size = 18

这篇关于如何正确计算字符串字节?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆