有没有办法改变Spark中RDD的复制因子? [英] Is there a way to change the replication factor of RDDs in Spark?

查看:103
本文介绍了有没有办法改变Spark中RDD的复制因子?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

据我了解,集群中RDD中有多个数据副本,因此在节点发生故障的情况下,程序可以恢复.但是,如果发生故障的机会可以忽略不计,则在RDD中具有多个数据副本将是昂贵的内存.因此,我的问题是,Spark中是否有一个参数可用于减少RDD的复制因子?

From what I understand, there are multiple copies of data in RDDs in the cluster, so that in case of failure of a node, the program can recover. However, in cases where chance of failure is negligible, it would be costly memory-wise to have multiple copies of data in the RDDs. So, my question is, is there a parameter in Spark, which can be used to reduce the replication factor of the RDDs?

推荐答案

首先,请注意,Spark不会自动缓存所有RDD,这仅仅是因为应用程序可能会创建许多RDD,而并非所有它们都会被重用.您必须在它们上调用.persist().cache().

First, note Spark does not automatically cache all your RDDs, simply because applications may create many RDDs, and not all of them are to be reused. You have to call .persist() or .cache() on them.

您可以设置要用来保存RDD的存储级别 myRDD.persist(StorageLevel.MEMORY_AND_DISK). .cache().persist(StorageLevel.MEMORY_ONLY)的简写.

You can set the storage level with which you want to persist an RDD with myRDD.persist(StorageLevel.MEMORY_AND_DISK). .cache() is a shorthand for .persist(StorageLevel.MEMORY_ONLY).

在Java或Scala中,persist的默认存储级别的确是StorageLevel.MEMORY_ONLYStorageLevel.MEMORY_ONLY –但是,如果要创建DStream,通常会有所不同(请参阅DStream构造函数API文档).如果您使用的是Python,则为StorageLevel.MEMORY_ONLY_SER.

The default storage level for persist is indeed StorageLevel.MEMORY_ONLY for an RDD in Java or Scala – but usually differs if you are creating a DStream (refer to your DStream constructor API doc). If you're using Python, it's StorageLevel.MEMORY_ONLY_SER.

文档详细介绍了许多存储级别以及它们的含义,但从根本上讲,它们是将Spark指向扩展

The doc details a number of storage levels and what they mean, but they're fundamentally a configuration shorthand to point Spark to an object which extends the StorageLevel class. You can thus define your own with a replication factor of up to 40.

请注意,在各种预定义的存储级别中,有些保留RDD的单个副本.实际上,对于所有名称都没有后缀_2(NONE除外)的人都是这样:

Note that of the various predefined storage levels, some keep a single copy of the RDD. In fact, that's true of all of those which name isn't postfixed with _2 (except NONE):

  • DISK_ONLY
  • MEMORY_ONLY
  • MEMORY_ONLY_SER
  • MEMORY_AND_DISK
  • MEMORY_AND_DISK_SER
  • OFF_HEAP

那是他们使用的每个介质一个副本,当然,如果您希望整体复制一个副本,则必须选择一个单一介质的存储级别.

That's one copy per medium they employ, of course, if you want a single copy overall, you have to choose a single-medium storage level.

这篇关于有没有办法改变Spark中RDD的复制因子?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆