读取多字节字符时,InputStream和InputStreamReader之间的差异 [英] The difference between InputStream and InputStreamReader when reading multi-byte characters

查看:151
本文介绍了读取多字节字符时,InputStream和InputStreamReader之间的差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

InputStream InputStreamReader 之间的区别是 InputStream 读为 byte ,而 InputStreamReader 读为 char 。例如,如果文件中的文本是 abc ,那么它们都可以正常工作。但是如果文本是一个你们,它由一个 a 和两个汉字组成,那么<​​code> InputStream 无效。



所以我们应该使用 InputStreamReader 我的问题是:



InputStreamReader 如何识别字符? b
$ b

a 是一个字节,但是一个汉字是两个字节。是否读取 a 为一个字节,并将其他字符识别为两个字节,或者对于本文中的每个字符, InputStreamReader 读取它为两个字节?

解决方案

InputStream 读取原始八位位组(8位)数据。在Java中, byte 类型等同于C中的 char 类型。在C中,此类型可用于表示字符数据或二进制数据。在Java中, char 类型与C wchar_t 类型具有更大的相似性。



InputStreamReader 然后将数据从一些编码转换为UTF-16。如果a你们在磁盘上编码为UTF-8,它将是字节序列 61 E4 BD A0 E4 BB AC 。当使用UTF-8编码将 InputStream 传递给 InputStreamReader 时,它将被读为char序列 0061 4F60 4EEC



Java中的字符编码API包含执行此转换的算法。您可以找到Oracle JRE支持的编码列表此处。如果您想了解实际操作中的内容, ICU项目是一个很好的起点。



As Alexander Pogrebnyak 指出,你应该几乎总是提供明确的编码。 byte -to - char 不指定编码的方法依赖于 JRE默认值,这取决于操作系统和用户设置。


The difference between InputStream and InputStreamReader is that InputStream reads as byte, while InputStreamReader reads as char. For example, if the text in a file is abc,then both of them work fine. But if the text is a你们, which is composed of an a and two Chinese characters, then the InputStream does not work.

So we should use InputStreamReader, but my question is:

How does InputStreamReader recognize characters?

a is one byte, but a Chinese character is two bytes. Does it read a as one byte and recognize the other of characters as two bytes, or for every character in this text, does the InputStreamReader read it as two bytes?

解决方案

An InputStream reads raw octet (8 bit) data. In Java, the byte type is equivalent to the char type in C. In C, this type can be used to represent character data or binary data. In Java, the char type shares greater similarities with the C wchar_t type.

An InputStreamReader then will transform data from some encoding into UTF-16. If "a你们" is encoded as UTF-8 on disk, it will be the byte sequence 61 E4 BD A0 E4 BB AC. When you pass the InputStream to InputStreamReader with the UTF-8 encoding, it will be read as the char sequence 0061 4F60 4EEC.

The character encoding API in Java contains the algorithms to perform this transformation. You can find a list of encodings supported by the Oracle JRE here. The ICU project is a good place to start if you want to understand the internals of how this works in practice.

As Alexander Pogrebnyak points out, you should almost always provide the encoding explicitly. byte-to-char methods that do not specify an encoding rely on the JRE default, which is dependent on operating systems and user settings.

这篇关于读取多字节字符时,InputStream和InputStreamReader之间的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆