在集群模式下使用Spark将文件写入本地系统 [英] Writing files to local system with Spark in Cluster mode

查看：4279 发布时间：2018/5/31 19:31:57 scala hadoop apache-spark

本文介绍了在集群模式下使用Spark将文件写入本地系统的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我知道这是一种使用Spark的奇怪方式，但我试图使用Spark将数据框保存到本地文件系统（而不是hdfs），即使我处于集群模式。我知道我可以使用客户端模式，但是我做希望以集群模式运行，并且不关心应用程序将作为驱动程序运行在哪个节点上（3）。
下面的代码是我想要做的伪代码。

  //创建数据帧
 val df = Seq（Foo（John，Doe），Foo（Jane，Doe））。toDF（）
 //将其保存到本地文件系统，使用'file： //'，因为它默认为hdfs：// 
 df.coalesce（1）.rdd.saveAsTextFile（sfile：// path / to / file）
  
 
 
 这就是我提交spark应用程序的方式。
 
 
   spark-submit --class sample.HBaseSparkRSample --master yarn-cluster hbase-spark-r-sample-assembly-1.0.jar  
 
 
 如果我处于本地模式但不在纱线群集模式中，则此功能正常。
 
 
 例如， java.io.IOException：Mkdirs无法创建文件出现在上面的代码中。
 
 
 我已将 df.coalesce（1）部分更改为 df.collect 并尝试使用普通的Scala保存文件，但它以 Permission denied 结束。 
 
 
 我也试过： 
 
 
 
   spark 根用户
 
   chown  ed  yarn：yarn ， yarn：hadoop ， spark：spark  
 
 为相关目录赋予 chmod 777  
 
 
 
 但没有运气。
 
 
 我假设这必须对集群，驱动程序和执行程序，以及用户谁正在尝试写入本地文件系统，但几乎不能解决此问题我使用：
 
 
  
  Spark：1.6.0- cdh5.8.2 
 
 Scala：2.10.5 
 
 Hadoop：2.6.0-cdh5.8.2 
 
 
 > 
 
 欢迎任何支持，并提前致谢。
 
 
 我试过的一些文章： 
 
 
 
 Spark saveAsTextFile（）导致Mkdirs无法为目录的一半创建 - >试图更改我们ers但没有任何变化
 
无法将RDD保存为文本文件到本地文件系统 - >  chmod 没有帮助我
 
 
 
 
 已编辑（2016/11/25）
 
 
 这是例外I  
 
 
  java.io.IOException：Mkdirs无法创建文件：/ home / foo / work / rhbase / r / input /input.csv/_temporary/0/_temporary/attempt_201611242024_0000_m_000000_0（exists = false，cwd = file：/ yarn / nm / usercache / foo / appcache / application_1478068613528_0143 / container_e87_1478068613528_0143_01_000001）
位于org.apache.hadoop.fs.ChecksumFileSystem .create（ChecksumFileSystem.java:449）
在org.apache.hadoop.fs.ChecksumFileSystem.create（ChecksumFileSystem.java:435）在org.apache.hadoop.fs.FileSystem.create 
（文件系统.java：920）
 at org.apache.hadoop.fs.FileSystem.create（FileSystem.java:813）
 at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter（TextOutputFormat.java:135 ）在org.apache.spark.SparkHadoopWri上
 ter.open（SparkHadoopWriter.scala：91）
。在org.apache.spark.rdd.PairRDDFunctions $$ anonfun $ $ saveAsHadoopDataset 1 $$ anonfun $ 13.apply（PairRDDFunctions.scala：1193）
。在有机apache.spark.rdd.PairRDDFunctions $$ anonfun $ $ saveAsHadoopDataset 1 $$ anonfun $ 13.apply（PairRDDFunctions.scala：1185）在org.apache.spark.scheduler.ResultTask.runTask（ResultTask.scala：66）
 
 at org.apache.spark.scheduler.Task.run（Task.scala：89）
 at org.apache.spark.executor.Executor $ TaskRunner.run（Executor.scala：214）
在java.util.concurrent.ThreadPoolExecutor.runWorker（ThreadPoolExecutor.java:1145）
在java.util.concurrent.ThreadPoolExecutor $ Worker.run（ThreadPoolExecutor.java:615）$ b $在java.lang。 Thread.run（Thread.java:745）
 16/11/24 20:24:12 WARN scheduler.TaskSetManager：在阶段0.0中丢失的任务0.0（TID 0，localhost）：java.io.IOException：Mkdirs失败创建文件：/home/foo/work/rhbase/r/input/input.csv/_temporary/0/_temporary/attempt_201611242024_0000_m_000000 _0（存在=假，CWD =文件：/纱线/纳米/ usercache /富/应用程序缓存/ application_1478068613528_0143 / container_e87_1478068613528_0143_01_000001）
。在org.apache.hadoop.fs.ChecksumFileSystem.create（ChecksumFileSystem.java:449）
在org.apache.hadoop.fs.ChecksumFileSystem.create（ChecksumFileSystem.java:435）
在org.apache.hadoop.fs.FileSystem.create（FileSystem.java:920）
在org .apache.hadoop.fs.FileSystem.create（FileSystem.java:813）
 at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter（TextOutputFormat.java:135）
 at org.apache.spark .SparkHadoopWriter.open（SparkHadoopWriter.scala：91）
。在org.apache.spark.rdd.PairRDDFunctions $$ anonfun $ $ saveAsHadoopDataset 1 $$ anonfun $ 13.apply（PairRDDFunctions.scala：1193）
。在有机.apache.spark.rdd.PairRDDFunctions $$ anonfun $ $ saveAsHadoopDataset 1 $$ anonfun $ 13.apply（PairRDDFunctions.scala：1185）
。在org.apache.spark.scheduler.ResultTask.runTask（ResultTask.scala：66） 
在org.apache.spark.sched uler.Task.run（Task.scala：89）
在org.apache.spark.executor.Executor $ TaskRunner.run（Executor.scala：214）
在java.util.concurrent.ThreadPoolExecutor。 （ThreadPoolExecutor.java:1145）
位于java.util.concurrent.ThreadPoolExecutor $ Worker.run（ThreadPoolExecutor.java:615）$ b $位于java.lang.Thread.run（Thread.java:745） 
  
 
 
解决方案
我将回答我自己的问题，因为最终，没有一个答案似乎没有解决我的问题。无论如何，感谢所有的答案，并指出我可以检查的选择。
 
 
 我认为@Ricardo是最接近提及Spark应用程序的用户。我用 Process（whoami）检查了 whoami ，用户是 yarn 。问题可能是我试图输出到 /home/foo/work/rhbase/r/input/input.csv ，尽管 / home / foo / work / rhbase 由 yarn：yarn ， / home / foo 拥有由 foo：foo 拥有。我没有详细检查，但这可能是这个权限问题的原因。
 
 
 当我点击 pwd 在我的Spark应用程序中使用 Process（pwd），它输出 / yarn /路径/到/某处。所以我决定输出我的文件到 /yarn/input.csv ，尽管在集群模式下，它仍然成功。
 
 
 我可能会得出这样的结论：这只是一个简单的权限问题。任何进一步的解决方案将是受欢迎的，但现在，这是我如何解决这个问题。
 
I know this is a weird way of using Spark but I'm trying to save a dataframe to the local file system (not hdfs) using Spark even though I'm in cluster mode.  I know I can use client mode but I do want to run in cluster mode and don't care which node (out of 3) the application is going to run on as driver.
The code below is the pseudo code of what I'm trying to do.
// create dataframe
val df = Seq(Foo("John", "Doe"), Foo("Jane", "Doe")).toDF()
// save it to the local file system using 'file://' because it defaults to hdfs://
df.coalesce(1).rdd.saveAsTextFile(s"file://path/to/file")
And this is how I'm submitting the spark application.

spark-submit --class sample.HBaseSparkRSample --master yarn-cluster hbase-spark-r-sample-assembly-1.0.jar

This works fine if I'm in local mode but doesn't in yarn-cluster mode.

For example, java.io.IOException: Mkdirs failed to create file occurs with the above code.

I've changed the df.coalesce(1) part to df.collect and attempted to save a file using plain Scala but it ended up with a Permission denied.

I've also tried:


spark-submit with root user
chowned yarn:yarn, yarn:hadoop, spark:spark
gave chmod 777 to related directories


but no luck.

I'm assuming this has to do something with clusters, drivers and executors, and the user who's trying to write to the local file system but am pretty much stuck in solving this problem by myself.

I'm using:


Spark: 1.6.0-cdh5.8.2
Scala: 2.10.5
Hadoop: 2.6.0-cdh5.8.2


Any support is welcome and thanks in advance.

Some articles I've tried:


"Spark saveAsTextFile() results in Mkdirs failed to create for half of the directory" -> Tried changing users but nothing changed
"Failed to save RDD as text file to local file system" -> chmod didn't help me


Edited (2016/11/25)

This is the Exception I get.
java.io.IOException: Mkdirs failed to create file:/home/foo/work/rhbase/r/input/input.csv/_temporary/0/_temporary/attempt_201611242024_0000_m_000000_0 (exists=false, cwd=file:/yarn/nm/usercache/foo/appcache/application_1478068613528_0143/container_e87_1478068613528_0143_01_000001)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:449)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:435)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:920)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:813)
    at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:135)
    at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1193)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1185)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
16/11/24 20:24:12 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.io.IOException: Mkdirs failed to create file:/home/foo/work/rhbase/r/input/input.csv/_temporary/0/_temporary/attempt_201611242024_0000_m_000000_0 (exists=false, cwd=file:/yarn/nm/usercache/foo/appcache/application_1478068613528_0143/container_e87_1478068613528_0143_01_000001)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:449)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:435)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:920)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:813)
    at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:135)
    at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1193)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1185)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

 解决方案 
I'm going to answer my own question because eventually, none of the answers didn't seem to solve my problem.  None the less, thanks for all the answers and pointing me to alternatives I can check.

I think @Ricardo was the closest in mentioning the user of the Spark application.  I checked whoami with Process("whoami") and the user was yarn.  The problem was probably that I tried to output to /home/foo/work/rhbase/r/input/input.csv and although /home/foo/work/rhbase was owned by yarn:yarn, /home/foo was owned by foo:foo.  I haven't checked in detail but this may have been the cause of this permission problem.

When I hit pwd in my Spark application with Process("pwd"), it output /yarn/path/to/somewhere.  So I decided to output my file to /yarn/input.csv and it was successful despite in cluster mode.

I probably can conclude that this was just a simple permission issue.  Any further solution would be welcome but for now, this was the way how I solved this question.

                        这篇关于在集群模式下使用Spark将文件写入本地系统的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

在集群模式下使用Spark将文件写入本地系统 [英] Writing files to local system with Spark in Cluster mode

问题描述

已编辑（2016/11/25）

Edited (2016/11/25)

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录关闭

在集群模式下使用Spark将文件写入本地系统 [英] Writing files to local system with Spark in Cluster mode

问题描述

已编辑（2016/11/25）

Edited (2016/11/25)

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录 关闭

登录关闭