如何保存群集上的文件 [英] How to save a file on the cluster

查看:173
本文介绍了如何保存群集上的文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 ssh 连接到集群,并使用

<$ p $将程序发送到集群> spark-submit --master yarn myProgram.py

我想将结果保存在文本文件中,然后尝试使用以下几行:

  counts.write.json(hdfs:// home / myDir / text_file.txt)
counts.write.csv(hdfs://home/myDir/text_file.csv)

但是,它们都不起作用。该程序完成,我找不到 myDir 中的文本文件。你有什么想法,我该怎么做?



另外,有没有办法直接写入我的本地机器?



编辑:我发现 home 目录不存在,所以现在我将结果保存为:
counts。 write.json(hdfs:///user/username/text_file.txt)
但是这会创建一个名为 text_file.txt的目录里面我有很多文件里面有部分结果。但我想要一个文件的最终结果。任何想法,我怎么能做到这一点?

解决方案

由于计算是分布式的,Spark会将结果保存在多个文件中。因此,写作:

$ $ p $ counts.write.csv(hdfs://home/myDir/text_file.csv)

表示将每个分区上的数据另存为文件夹中的单独文件 text_file.csv 。如果您希望将数据保存为单个文件,请首先使用 coalesce(1)

  counts.coalesce(1).write.csv(hdfs://home/myDir/text_file.csv)

这会将所有数据放入一个单独的分区,因此保存的文件数量将为1.但是,如果您有大量数据,这可能是个坏主意。如果数据非常小,那么使用 collect()是一种选择。这将把所有的数据作为一个数组放到驱动器机器上,然后可以保存为一个文件。


I'm connected to the cluster using ssh and I send the program to the cluster using

spark-submit --master yarn myProgram.py

I want to save the result in a text file and I tried using the following lines:

counts.write.json("hdfs://home/myDir/text_file.txt")
counts.write.csv("hdfs://home/myDir/text_file.csv")

However, none of them work. The program finishes and I cannot find the text file in myDir. Do you have any idea how can I do this?

Also, is there a way to write directly to my local machine?

EDIT: I found out that home directory doesn't exist so now I save the result as: counts.write.json("hdfs:///user/username/text_file.txt") But this creates a directory named text_file.txt and inside I have a lot of files with partial results inside. But I want one file with the final result inside. Any ideas how I can do this ?

解决方案

Spark will save the results in multiple files since the computation is distributed. Therefore writing:

counts.write.csv("hdfs://home/myDir/text_file.csv")

means to save the data on each partition as a separate file in the folder text_file.csv. If you want the data saved as a single file, use coalesce(1) first:

counts.coalesce(1).write.csv("hdfs://home/myDir/text_file.csv")

This will put all the data into a single partition and the number of saved files will thus be 1. However, this could be a bad idea if you have a lot of data. If the data is very small then using collect() is an alternative. This will put all data onto the driver machine as an array, which can then be saved as a single file.

这篇关于如何保存群集上的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆