如何建立编码字符的编码点? [英] How to establish the codepoint of encoded characters?

查看:127
本文介绍了如何建立编码字符的编码点?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出一个字节流(代表字符)并对该流进行编码,我如何获得字符的代码点?

Given a stream of bytes (that represent characters) and the encoding of the stream, how would I obtain the code points of the characters?

InputStreamReader r = new InputStreamReader(bla, Charset.forName("UTF-8"));
int whatIsThis = r.read(); 

上面的代码段中的read()返回了什么?是unicode代码点吗?

What is returned by read() in the above snippet? Is it the unicode codepoint?

推荐答案

A char(隐式)是UTF-16BE编码中的16位代码单元.此编码可以用单个char表示基本的多语言平面字符. 补充范围使用两个char序列表示.

A char is (implicitly) a 16-bit code unit in the UTF-16BE encoding. This encoding can represent basic multilingual plane characters with a single char. The supplementary range is represented using two-char sequences.

Character 类型包含将UTF-16代码单元转换为Unicode代码点的方法:

The Character type contains methods for translating UTF-16 code units to Unicode code points:

需要两个char的代码点将满足 codePointAt 方法可用于从代码单元序列中提取代码点.从代码点到UTF-16代码单元,都有类似的工作方法.

A code point that requires two chars will satisfy the isHighSurrogate and isLowSurrogate when you pass in two sequential values from a sequence. The codePointAt methods can be used to extract code points from code unit sequences. There are similar methods for working from code points to UTF-16 code units.

代码点流阅读器的示例实现:

A sample implementation of a code point stream reader:

import java.io.*;
public class CodePointReader implements Closeable {
  private final Reader charSource;
  private int codeUnit;

  public CodePointReader(Reader charSource) throws IOException {
    this.charSource = charSource;
    codeUnit = charSource.read();
  }

  public boolean hasNext() { return codeUnit != -1; }

  public int nextCodePoint() throws IOException {
    try {
      char high = (char) codeUnit;
      if (Character.isHighSurrogate(high)) {
        int next = charSource.read();
        if (next == -1) { throw new IOException("malformed character"); }
        char low = (char) next;
        if(!Character.isLowSurrogate(low)) {
          throw new IOException("malformed sequence");
        }
        return Character.toCodePoint(high, low);
      } else {
        return codeUnit;
      }
    } finally {
      codeUnit = charSource.read();
    }
  }

  public void close() throws IOException { charSource.close(); }
}

这篇关于如何建立编码字符的编码点?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆