Hadoop中Text和String的区别 [英] Difference between Text and String in Hadoop

查看:34
本文介绍了Hadoop中Text和String的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

org.apache.hadoop.io.Textjava.lang.String 在 Hadoop 框架中有什么区别?

What is the difference between org.apache.hadoop.io.Text and java.lang.String in the Hadoop framework?

为什么他们不能使用 String 而不是引入一个新的 Text 类?

Why couldn't they use String instead of introducing a new Text class?

我调查了差异,发现它与编码格式有关;但是我还不明白.

I investigated the difference and found out it has to do with the encoding format; however I don't understand it yet.

有人可以解释这些差异吗(如果适用,请举例说明)?

Can someone explain the differences (with examples, if applicable)?

推荐答案

Text 对象的二进制表示是一个可变长度的整数,包含字符串的 UTF-8 表示中的字节数,后跟 UTF-8字节本身.

The binary representation of a Text object is a variable length integer containing the number of bytes in the UTF-8 representation of the string, followed by the UTF-8 bytes themselves.

Text 是 UTF8 类的替代品,该类已被弃用因为它不支持编码超过 32,767 字节的字符串,并且因为它使用了 Java 修改后的 UTF-8.

Text is a replacement for the UTF8 class, which was deprecated because it didn’t support strings whose encoding was over 32,767 bytes, and because it used Java’s modified UTF-8.

此外,Text 使用标准的 UTF-8,这使得与理解 UTF-8 的其他工具进行交互操作可能更容易.

Furthermore, Text uses standard UTF-8, which makes it potentially easier to inter operate with other tools that understand UTF-8.

以下是与字符串功能相关的一些简要差异:

Following are some of the differences in brief related to its functioning with respect to String:

索引:因为它强调使用标准的UTF-8,所以有一些区别在 Text 和 Java String 类之间.Text 类的索引是根据编码字节序列中的位置,而不是字符串中的 Unicode 字符或 Javachar 代码单元(就像 String 一样).

Indexing: Because of its emphasis on using standard UTF-8, there are some differences between Text and the Java String class. Indexing for the Text class is in terms of position in the encoded byte sequence, not the Unicode character in the string, or the Java char code unit (as it is for String).

例如,charAt() 返回一个表示 Unicode 代码点的 int,与返回字符的字符串变体.

For instance, charAt() returns an int representing a Unicode code point, unlike the String variant that returns a char.

迭代:使用字节遍历 Text 中的 Unicode 字符变得复杂索引的偏移量,因为你不能只增加索引.

Iteration: Iterating over the Unicode characters in Text is complicated by the use of byte offsets for indexing, since you can’t just increment the index.

可变:与 String 的另一个区别是 Text 是可变的(就像 Hadoop 中的所有 Writable 实现一样,除了 NullWritable,它是一个单例).你可以重用一个通过调用其中一种 set() 方法来获取文本实例.

Mutable: Another difference with String is that Text is mutable (like all Writable implementations in Hadoop, except NullWritable, which is a singleton). You can reuse a Text instance by calling one of the set()methods on it.

诉诸字符串:

Text 没有像操作字符串那样丰富的 APIjava.lang.String,所以很多时候需要把Text对象转换成String.这是以通常的方式完成的,使用 toString() 方法:

Text doesn’t have as rich an API for manipulating strings as java.lang.String, so in many cases, you need to convert the Text object to a String. This is done in the usual way, using the toString() method:

有关详细信息,请阅读权威指南.

For more details read definitive guide.

这篇关于Hadoop中Text和String的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆