Hadoop中文本和字符串的区别 [英] Difference between Text and String in Hadoop

查看:243
本文介绍了Hadoop中文本和字符串的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

org.apache.hadoop.io.Text java.lang.String 之间的区别Hadoop框架?



为什么他们不能使用 String 而不是引入新的 Text class?



我研究了差异,发现它与编码格式有关;但我还不明白。



有人可以解释其中的差异吗(例如,如果适用的话)?

解决方案

Text对象的二进制表示是一个可变长度整数,其中包含
字符串的UTF-8表示形式的字节数,后面跟着UTF-8
字节本身。

文本是UTF8类的替代品,因为它不支持编码超过32,767字节的字符串,因此已弃用
,并且因为
使用了Java的经过修改的UTF-8。此外,Text使用标准的UTF-8,这使得与其他理解UTF-8的工具交互操作可能更容易。


以下是与其在字符串方面的功能有关的一些不同之处:

索引:
由于强调使用标准的UTF-8,因此Text和Java String类之间存在一些差异
。对Text类的索引是根据编码字节序列中的位置而不是字符串中的Unicode字符或Java
字符代码单元(与String相同)。

例如, charAt()返回一个表示Unicode代码点的int,与返回char的
String变体不同。 p>

迭代:
使用字节
偏移量进行索引时,迭代Text中的Unicode字符会变得复杂,因为您不能只增加索引。

可变:
与String的另一个区别是Text是可变的(就像所有的Writable实现一样在Hadoop中,除了NullWritable,它是一个单例)。您可以通过调用其中一个set()方法来重用
Text实例。



使用字符串:



文本没有像操作字符串那样丰富的API,比如
java.lang.String ,所以在很多情况下,您需要将Text对象转换为String。
这是按照常规方式完成的,使用 toString()方法:

有关更多详细信息阅读权威指南。

What is the difference between org.apache.hadoop.io.Text and java.lang.String in the Hadoop framework?

Why couldn't they use String instead of introducing a new Text class?

I investigated the difference and found out it has to do with the encoding format; however I don't understand it yet.

Can someone explain the differences (with examples, if applicable)?

解决方案

The binary representation of a Text object is a variable length integer containing the number of bytes in the UTF-8 representation of the string, followed by the UTF-8 bytes themselves.

Text is a replacement for the UTF8 class, which was deprecated because it didn’t support strings whose encoding was over 32,767 bytes, and because it used Java’s modified UTF-8.

Furthermore, Text uses standard UTF-8, which makes it potentially easier to inter operate with other tools that understand UTF-8.

Following are some of the differences in brief related to its functioning with respect to String:

Indexing: Because of its emphasis on using standard UTF-8, there are some differences between Text and the Java String class. Indexing for the Text class is in terms of position in the encoded byte sequence, not the Unicode character in the string, or the Java char code unit (as it is for String).

For instance, charAt() returns an int representing a Unicode code point, unlike the String variant that returns a char.

Iteration: Iterating over the Unicode characters in Text is complicated by the use of byte offsets for indexing, since you can’t just increment the index.

Mutable: Another difference with String is that Text is mutable (like all Writable implementations in Hadoop, except NullWritable, which is a singleton). You can reuse a Text instance by calling one of the set()methods on it.

Resorting to String:

Text doesn’t have as rich an API for manipulating strings as java.lang.String, so in many cases, you need to convert the Text object to a String. This is done in the usual way, using the toString() method:

For more details read definitive guide.

这篇关于Hadoop中文本和字符串的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆