Spark分布式保存文件 [英] Spark save files distributedly

查看:128
本文介绍了Spark分布式保存文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据 Spark文档

Spark中的所有转换都是惰性的,因为它们不会立即计算出结果.取而代之的是,他们只记得应用于某些基本数据集(例如文件)的转换.仅当操作要求将结果返回给驱动程序时才计算转换.

All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program.

我目前正在处理一个大型数据集,该数据集一旦被处理,就会输出甚至更大的数据,这些数据需要存储在文本文件中,就像使用命令saveAsTextFile(path)一样.

I am currently working on a large dataset that, once processed, outputs even a bigger amount of data, which needs to be stored in text files, as done with the command saveAsTextFile(path).

到目前为止,我一直在使用这种方法.但是,由于这是一个操作(如上所述),而不是一个转换,因此Spark需要将数据从每个分区发送到驱动程序节点,从而大大降低了保存过程.

So far I have been using this method; however, since it is an action (as stated above) and not a transformation, Spark needs to send data from every partition to the driver node, thus slowing down the process of saving quite a bit.

我想知道Spark上是否存在任何分布式文件保存方法(类似于saveAsTextFile()),从而使每个执行程序可以自己存储自己的分区.

I was wondering if any distributed file saving method (similar to saveAsTextFile()) exists on Spark, enabling each executor to store its own partition by itself.

推荐答案

我认为您误解了将结果发送给驱动程序的含义. saveAsTextFile不会将数据发送回驱动程序.相反,一旦完成,它将保存的结果发送回驱动程序.即,saveAsTextFile 是分布式的.唯一不分发的情况是,只有一个分区,或者在调用saveAsTextFile之前将RDD合并回一个分区.

I think you're misinterpreting what it means to send a result to the driver. saveAsTextFile does not send the data back to the driver. Rather, it sends the result of the save back to the driver once it's complete. That is, saveAsTextFile is distributed. The only case where it's not distributed is if you only have a single partition or you've coallesced your RDD back to a single partition before calling saveAsTextFile.

该文档所指的是将saveAsTextFile(或任何其他操作")的结果发送回驱动程序.如果调用collect(),它将确实将数据发送到驱动程序,但是saveAsTextFile仅在完成后才将成功/失败消息发送回驱动程序.保存本身仍在群集中的许多节点上完成,这就是为什么您最终会得到许多文件的原因-每个分区一个.

What that documentation is referring to is sending the result of saveAsTextFile (or any other "action") back to the driver. If you call collect() then it will indeed send the data to the driver, but saveAsTextFile only sends a succeed/failed message back to the driver once complete. The save itself is still done on many nodes in the cluster, which is why you'll end up with many files - one per partition.

IO总是很昂贵.但是有时正是由于该摘录中描述的惰性行为,似乎saveAsTextFile甚至更加昂贵.本质上,当调用saveAsTextFile时,Spark可能会在保存之前执行许多或所有先前的操作.那就是懒惰.

IO is always expensive. But sometimes it can seem as if saveAsTextFile is even more expensive precisely because of the lazy behavior described in that excerpt. Essentially, when saveAsTextFile is called, Spark may perform many or all of the prior operations on its way to being saved. That is what is meant by laziness.

如果您设置了Spark UI,则可以让您更好地了解保存数据的过程中发生的数据(如果尚未执行此操作).

If you have the Spark UI set up it may give you better insight into what is happening to the data on its way to a save (if you haven't already done that).

这篇关于Spark分布式保存文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆