为什么可写的数据类型是可变的? [英] Why should a Writable datatype be Mutable?

查看:215
本文介绍了为什么可写的数据类型是可变的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为什么应该写一个可写的数据类型是可变的?
使用Text(vs String)作为Map,Combine,Shuffle或Reduce流程中的键/值的数据类型的优点是什么?

谢谢&问候,
Raja

解决方案

您无法选择,这些数据类型必须

原因是序列化机制。让我们看看代码:

  // version 1.x MapRunner#run()
K1 key = input。 createKey();
V1 value = input.createValue();

(input.next(key,value)){
//映射对输出
mapper.map(key,value,output,reporter);
...

所以我们重复使用了相同的键/值对实例再次。 为什么?当时我不知道设计决定,但我认为这是减少垃圾对象的数量。请注意,Hadoop相当古老,当时垃圾收集器的效率不如现在,但即使在今天,如果您要映射数十亿个对象并直接将其作为垃圾丢弃,它在运行时会产生很大的差异。

为什么不能使 Writable 类型真正不可变的真正原因是您无法声明字段作为最终。让我们用 IntWritable 作一个简单的例子:

  public class IntWritable implements WritableComparable {
private int value;

public IntWritable(){}

public IntWritable(int value){set(value); }
...

如果你想使它不可变,它肯定不适用于序列化过程,因为您需要定义 value final。这是行不通的,因为键和值是在运行时通过反射实例化的。这需要一个默认构造函数,因此 InputFormat 无法猜测填充最终数据字段所需的参数。因此,重用实例的整个概念显然与不可变性的概念相矛盾。然而,你应该问自己一个不可变的键/值在Map / Reduce中应该有什么样的好处。在Joshua Bloch的Effective Java第15项中,他指出不可变类更容易设计,实现和使用。他是对的,因为Hadoop的reducer是可变性最糟糕的例子:

  void reduce(IntWritable key,Iterable< Text>值,Context上下文)... 

iterable中的每个值都引用相同共享对象。因此,如果许多人将自己的价值观念变为正常收藏,并问自己为什么总是保持相同的价值观,那么他们会感到困惑。最后归结为性能的折衷(CPU和内存 - 想象单个密钥的数十亿个价值对象必须驻留在RAM中)与简单。

Why should a Writable datatype be Mutable? What are the advantages of using Text (vs String) as a datatype for Key/Value in Map, Combine, Shuffle or Reduce processes?

Thanks & Regards, Raja

解决方案

You can't choose, these datatypes must be mutable.

The reason is the serialization mechanism. Let's look at the code:

// version 1.x MapRunner#run()
K1 key = input.createKey();
V1 value = input.createValue();

while (input.next(key, value)) {
   // map pair to output
   mapper.map(key, value, output, reporter);
   ...

So we are reusing the same instance of key/value pairs all over again. Why? I don't know about the design decisions back then, but I assume it was to reduce the amount of garbage objects. Note that Hadoop is quite old and back then the garbage collectors were not as efficient as they are today, however even today it makes a big difference in runtime if you would map billions of objects and directly throw them away as garbage.

The real reason why you can't make the Writable type really immutable is that you can't declare fields as final. Let's make a simple example with the IntWritable:

public class IntWritable implements WritableComparable {
  private int value;

  public IntWritable() {}

  public IntWritable(int value) { set(value); }
...

If you would make it immutable it would certainly not work with the serialization process anymore, because you would need to define value final. This can't work, because the keys and values are instantiated at runtime via reflection. This requires a default constructor and thus the InputFormat can't guess the parameter that would be necessary to fill the final data fields. So the whole concept of reusing instances obviously contradicts the concept of immutability.

However, you should ask yourself what kind of benefit an immutable key/value should have in Map/Reduce. In Joshua Bloch's- Effective Java, Item 15 he states that immutable classes are easier to design, implement and use. And he is right, because Hadoop's reducer is the worst possible example for mutability:

void reduce(IntWritable key, Iterable<Text> values, Context context) ...

Every value in the iterable refers to the same shared object. Thus many people are confused if they buffer their values into a normal collection and ask themselves why it always retains the same values.

In the end it boils down to the trade-off of performance (cpu and memory- imagine many billions of value objects for a single key must reside in RAM) vs. simplicity.

这篇关于为什么可写的数据类型是可变的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆