创建和显示从简单的JSON文件火花数据框 [英] create and display spark dataframe from simple json file

查看:400
本文介绍了创建和显示从简单的JSON文件火花数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面简单的JSON数据帧的测试以本地模式运行时,星火正常工作。这里是Scala的片段,但我已经顺利拿到了在Java和Python的工作同样的事情还有:

  sparkContext.addFile(jsonPath)VAL sqlContext =新org.apache.spark.sql.SQLContext(sparkContext)
VAL数据框= sqlContext.jsonFile(jsonPath)
dataFrame.show()

我确定jsonPath同时从驾驶员侧以及工人方面的工程。我打电话addFile ... JSON文件非常简单:

  [{时代:21日,名:ABC},{时代:30,名:DEF},{时代: 45,名:GHI}]

完全相同的code,当我切换出的本地模式,并使用单独的服务器星火用一个单一的主/工人失败。我试过在Scala中,Java和Python的同样的测试,试图找到某种组合的作品。他们都基本上得到了同样的错误。下面的错误是在斯卡拉的驱动程序,但在Java / Python的错误信息几乎是一样的:

  15/04/17 18点05分26秒WARN TaskSetManager:失落任务0.0舞台0.0(TID 0,10.0.2.15):java.io.EOFException
    在java.io.ObjectInputStream中的$ BlockDataInputStream.readFully(ObjectInputStream.java:2747)
    在java.io.ObjectInputStream.readFully(ObjectInputStream.java:1033)
    在org.apache.hadoop.io.DataOutputBuffer $ Buffer.write(DataOutputBuffer.java:63)
    在org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
    在org.apache.hadoop.io.UTF8.readChars(UTF8.java:216)
    在org.apache.hadoop.io.UTF8.readString(UTF8.java:208)

这是非常令人沮丧。基本上,我想从官方文档code段工作。

更新:谢谢保罗为深入回应。我正在做同样的步骤错误。仅供参考,前面我使用的是驱动程序故名sparkContext而不是SC的外壳默认名称。在这里,用过量采伐的简短片段中删除:

 ➜火花1.3.0 ./bin/spark-shell --master火花://172.28.128.3:7077
欢迎来到
      ____ __
     / __ / __ ___ _____ / / __
    _ \\ \\ / _ \\ / _`/ __ /'_ /
   / ___ / .__ / \\ _,_ / _ / / _ / \\ _ \\ 1.3.0版本
      / _ /使用Scala版本2.11.2(Java的热点(TM)64位服务器VM,爪哇1.8.0_40)
在EX pressions类型,让他们评估。
类型:帮助更多的信息。
作为SC星火上下文。
作为sqlContext SQL上下文。斯卡拉> VAL数据框= sqlContext.jsonFile(/私营/无功/ userspark / test.json)
在第一阶段0.0(TID 0,10.0.2.15)丢失任务0.0:15/04/20 18点01分06秒WARN TaskSetManager java.io.EOFException
    在java.io.ObjectInputStream中的$ BlockDataInputStream.readFully(ObjectInputStream.java:2747)
    在java.io.ObjectInputStream.readFully(ObjectInputStream.java:1033)
    在org.apache.hadoop.io.DataOutputBuffer $ Buffer.write(DataOutputBuffer.java:63)
    在org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
    在org.apache.hadoop.io.UTF8.readChars(UTF8.java:216)
    在org.apache.hadoop.io.UTF8.readString(UTF8.java:208)
    在org.apache.hadoop.ma pred.FileSplit.readFields(FileSplit.java:87)
    在org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:237)
    (......)
    在java.util.concurrent.ThreadPoolExecutor中的$ Worker.run(ThreadPoolExecutor.java:617)
    在java.lang.Thread.run(Thread.java:745)org.apache.spark.SparkException:作业已中止因故障阶段:在第一阶段0.0(TID 6,10.0.2.15)丢失任务0.3:任务0级0.0失败了4次,最近一次失败java.io.EOFException
    在java.io.ObjectInputStream中的$ BlockDataInputStream.readFully(ObjectInputStream.java:2747)


解决方案

虽然我可以让你简单的例子工作,我同意火花,可令人沮丧...

在这里,我有火花1.3.0,从源头建立与OpenJDK的8。

使用您的文件火花壳火花提交失败的原因不尽相同,这里可能是例子相比发行code /文档已经过时,需要进行小幅调整。

例如,在火花外壳, sparkContext 已经可以作为 SC ,而不是 sparkContext ,且有pdefined类似$ p $ sqlContext 。火花壳发出信息消息宣布这些环境的创建。

有关火花提交,我收到一些罐子的错误。这可能是一个局部问题。

总之,它运行良好,如果我缩短。这也似乎并不重要,执行这短短的例子的目的,JSON文件是否有每行或没有一个对象。为未来的测试,如果它在整个内核并行运行,并且,如果它需要每行一个对象(没有逗号或顶部支架)来完成这一点,可能是有用的,以产生一个大的例子,并确定

so1-works.sc

  VAL数据框= sqlContext.jsonFile(/数据/ so1.json)
dataFrame.show()

输出,SUP pressing INFO等消息...

 保罗@ ki6cq:〜/火花/火花1.3.0 $ ./bin/spark-shell --master火花://192.168.1.10:7077< ./ SO1 -works.sc 2  - ;的/ dev / null的
欢迎来到
      ____ __
     / __ / __ ___ _____ / / __
    _ \\ \\ / _ \\ / _`/ __ /'_ /
   / ___ / .__ / \\ _,_ / _ / / _ / \\ _ \\ 1.3.0版本
      / _ /使用Scala版本2.10.4(OpenJDK的64位服务器虚拟机,Java的1.8.0_40-内部)
在EX pressions类型,让他们评估。
类型:帮助更多的信息。
作为SC星火上下文。
作为sqlContext SQL上下文。斯卡拉>斯卡拉> VAL数据框= sqlContext.jsonFile(/数据/ so1.json)
数据框:org.apache.spark.sql.DataFrame = [年龄:BIGINT,名称:字符串]斯卡拉> dataFrame.show()
年龄名称
21 ABC
30 DEF
45 GHI斯卡拉>停止火花上下文。
保罗@ ki6cq:〜/火花/火花1.3.0 $ @保罗ki6cq:〜/火花/火花1.3.0 $

奇注意:后来,我不得不执行重置来得到我的Linux终端恢复正常。

好了,所以,首先,尽量缩短这个例子,因为我已经做了。

如果不解决这个问题,你可以尝试复制我的环境。

因为我用泊坞窗的师傅和工人,并已张贴的图片向公众dockerhub这可能是简单的。

注意未来的读者:我的公共dockerhub图像不火花的官方​​图片,并随时更改或删除。

您需要两台计算机(一个运行Linux或泊坞窗兼容的操作系统来承载泊坞窗容器中的高手和工人,其他的也$ pferably运行Linux或某事有火花1.3.0版本p $)家庭背后防火墙路由器设备(DLink的,Netgrear,等...)。我假设本地网络是192.168.1。*,即192.168.1.10和0.11都是免费的,而路由器将正确路由,或者您知道如何使它正确路由。您可以在下面的运行脚本改变这些地址。

如果你只需要一台电脑,由于架桥我在这里使用可能无法正常工作回到传达给主机联网方式的详情。结果
它可以做的工作,但就是有点超过我想添加到一个已经很长的帖子。

在一台Linux计算机,安装泊坞窗的的管道工具,而这些shell脚本(调整授予星火记忆,编辑出更多的工人似乎没有必要):

./运行泊坞窗火花

 #!/斌/庆典
须藤-v
MASTER = $(泊坞窗运行--name =大师-h主--add-主机主:192.168.1.10 --add主机spark1:192.168.1.11 --add主机spark2:192.168.1.12 --add-主持人spark3:192.168.1.13 --add主机spark4:192.168.1.14 --expose = 65535 --env SPARK_MASTER_IP = 192.168.1.10 -d drpaulbrewer /火花主:最新)
须藤管道的eth0 $ MASTER 192.168.1.10/24@192.168.1.1
SPARK1 = $(泊坞窗运行--name =spark1-h spark1 --add-主机主:192.168.1.10 --add主机spark1:192.168.1.11 --add主机spark2:192.168.1.12 --add-主持人spark3:192.168.1.13 --add主机spark4:192.168.1.14 --expose = 65535 --env纪念品= 10G --env主=火花://192.168.1.10:7077 -v /数据:/数据-v / tmp目录:/ tmp目录-d drpaulbrewer /火花工人:最新)
须藤管道的eth0 $ SPARK1 192.168.1.11/24@192.168.1.1

./停泊坞窗火花

 #!/斌/庆典
泊坞窗杀主spark1
搬运工RM主spark1

其他的Linux电脑将是你的用户的计算机,并且需要火花1.3.0的版本。让两台计算机上一个/ data目录并安装JSON文件存在。然后运行./run-docker-spark只有一次,充当容器(如虚拟机),将举行大师和工人结合主机计算机上。要停止火花系统,请使用停止脚本。如果重新启动,或者有一个糟糕的错误,你需要运行停止脚本之前运行脚本将重新工作。

请检查主机和工人在 http://192.168.1.10:8080

如果这样,那么你应该是很好的尝试火花shell命令行往上顶。

您不需要这些dockerfiles,因为张贴在公共dockerhub的构建和下载在泊坞窗自动运行。但在这里他们是如果你想看到的东西是如何建,的JDK,maven的命令等。

我开始与一个共同的Dockerfile我把在一个名为DIR 火花烤大象,因为这是一个非的Hadoop构建和O'Reilley书的Hadoop对其有大象。你需要一个火花1.3.0源码包从火花网站放弃与Dockerfile的目录。这可能Dockerfile不暴露足够的端口(火花是关于其使用的端口非常混杂,而码头工人不幸被设计成包含和记录使用的端口),以及揭露运行师傅和工人的shell脚本被覆盖。这会引起一些不愉快,如果你问泊坞窗列出正在运行什么,因为包括端口列表。

 保罗@ home的:/ Z /泊坞窗$猫./spark-roasted-elephant/Dockerfile
#2015年版权所有保罗 - 布鲁尔http://eaftc.com
#许可证:MIT
#这个搬运工文件建立的火花为独立实验的非的Hadoop版本
#感谢文章在http://mbonaci.github.io/mbo-spark/的提示
从Ubuntu:15.04
MAINTAINER drpaulbrewer@eaftc.com
RUN的adduser --disabled密码--home / SPARK乐驰
WORKDIR /火花
ADD火花1.3.0.tgz /火花/
WORKDIR /spark/spark-1.3.0
RUN SED -e'S / archive.ubuntu.com / www.gtlib.gatech.edu \\ /酒馆/'的/etc/apt/sources.list> /tmp/sources.list&功放;&安培; MV /tmp/sources.list的/etc/apt/sources.list
运行apt-get更新和放大器;&安培; apt-get的升级--yes \\
    &功放;&安培; apt-get的安装--yes SED纳米卷曲的wget的OpenJDK-8-jdk的斯卡拉\\
    &功放;&安培;回声JAVA_HOME = / usr / lib目录/ JVM / JAVA-8的OpenJDK-amd64的>>的/ etc /环境\\
    &功放;&安培;出口MAVEN_OPTS = - Xmx2g -XX:MaxPermSize参数= 512M -XX:保留codeCacheSize =512米\\
    &功放;&安培; ./build/mvn -Phive -Phive-thriftserver -DskipTests清洁套装\\
    &功放;&安培; CHOWN -R火花:火花/火花\\
    &功放;&安培;的mkdir / var / run中/ sshd的
EXPOSE 2222 4040 6066 7077 7777 8080 8081

在主站从目录中建./spark-master使用dockerfile和shell脚本要加入到容器中。下面是dockerfile和shell脚本。

 保罗@ home的:/ Z /泊坞窗$猫./spark-master/Dockerfile
从drpaulbrewer /火花烤大象:最新
MAINTAINER drpaulbrewer@eaftc.com
ADD my-spark-master.sh /火花/
用户火花
CMD /spark/my-spark-master.sh保罗@ home的:/ Z /泊坞窗$猫./spark-master/my-spark-master.sh
#!/斌/ bash的-e
CD /spark/spark-1.3.0
#设置SPARK_MASTER_IP到网接口地址,例如192.168.1.10
出口SPARK_MASTER_IP
./sbin/start-master.sh
睡眠10000D

和为工人:

 保罗@ home的:/ Z /泊坞窗$猫./spark-worker/Dockerfile
从drpaulbrewer /火花烤大象:最新
MAINTAINER drpaulbrewer@eaftc.com
ADD my-spark-worker.sh /火花/
CMD /spark/my-spark-worker.sh
保罗@ home的:/ Z /泊坞窗$猫./spark-worker/my-spark-worker.sh
#!/斌/ bash的-e
CD /spark/spark-1.3.0
睡眠10
#不使用./sbin/start-slave.sh它不会采取数字网址
MKDIR -p / Z /数据
MKDIR -p /用户/蜂巢/仓库
CHOWN -R火花:火花/用户
苏-cCD /spark/spark-1.3.0&放大器;&安培; ./bin/spark-class org.apache.spark.deploy.worker.Worker --memory $纪念品$大师火花

虽然现在这个帖子已经变成了答案,我怎么做Dockerfiles火花?它不旨在是。这些Dockerfiles是实验对我来说,我没有在生产中使用它们,不保证其质量。我不喜欢的火花备受推崇源泊坞窗,因为他们从事链一堆容器一起偷懒的做法,这是巨大的,没完没了下载。在这里,有少得多的层,并在下载小。这是贴与其说是作为一个码头工人的例子,但这样你就可以决定什么是在自己的环境有所不同。

The following simple json DataFrame test works fine when running Spark in local mode. Here is the Scala snippet, but I've successfully got the same thing working in Java and Python as well:

sparkContext.addFile(jsonPath)

val sqlContext = new org.apache.spark.sql.SQLContext(sparkContext)
val dataFrame = sqlContext.jsonFile(jsonPath)
dataFrame.show()

I made sure the jsonPath works from both the driver side as well as the worker side. And I'm calling addFile... The json file is very trivial:

[{"age":21,"name":"abc"},{"age":30,"name":"def"},{"age":45,"name":"ghi"}]

The exact same code fails when I switch out of local mode and use a separate Spark server with a single master/worker. I've tried this same test in Scala, Java, and Python to try to find some combination that works. They all get basically the same error. The following error is from the Scala driver program but the Java/Python error messages are nearly identical:

15/04/17 18:05:26 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 10.0.2.15): java.io.EOFException
    at java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2747)
    at java.io.ObjectInputStream.readFully(ObjectInputStream.java:1033)
    at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
    at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
    at org.apache.hadoop.io.UTF8.readChars(UTF8.java:216)
    at org.apache.hadoop.io.UTF8.readString(UTF8.java:208)

This is very frustrating. I'm basically trying to get code snippets from the official docs to work.

UPDATE: Thank you Paul for the in depth response. I am getting errors doing the same steps. FYI, earlier I was using a driver program hence the name sparkContext rather than the shell default name of sc. Here is an abbreviated snippet with excess logging removed:

➜  spark-1.3.0  ./bin/spark-shell --master spark://172.28.128.3:7077
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.3.0
      /_/

Using Scala version 2.11.2 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_40)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
SQL context available as sqlContext.

scala> val dataFrame = sqlContext.jsonFile("/private/var/userspark/test.json")
15/04/20 18:01:06 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 10.0.2.15): java.io.EOFException
    at java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2747)
    at java.io.ObjectInputStream.readFully(ObjectInputStream.java:1033)
    at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
    at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
    at org.apache.hadoop.io.UTF8.readChars(UTF8.java:216)
    at org.apache.hadoop.io.UTF8.readString(UTF8.java:208)
    at org.apache.hadoop.mapred.FileSplit.readFields(FileSplit.java:87)
    at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:237)
    (...)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6, 10.0.2.15): java.io.EOFException
    at java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2747)

解决方案

While I can get your simple example working, I agree that spark can be frustrating...

Here I have spark 1.3.0, built from source with openjdk 8.

Using your file with spark-shell or spark-submit is failing for different reasons, where possibly the examples/docs is obsolete compared to the released code and needs to be adjusted slightly.

For instance, in spark-shell, sparkContext is already available as sc, not as sparkContext, and there is a similar predefined sqlContext. spark-shell emits INFO messages announcing the creation of these contexts.

For spark-submit, I am getting some kind of jar error. That may be a local issue.

Anyway, it runs fine if I shorten it. It also doesn't seem to matter, for the purpose of executing this short example, whether the json file has one object per line or not. For a future test it might be useful to generate a large example and determine if it runs in parallel across the cores and if it needs one object per line (with no commas or top bracket) to accomplish this.

so1-works.sc

val dataFrame = sqlContext.jsonFile("/data/so1.json")
dataFrame.show()

Output, suppressing INFO, etc. messages...

paul@ki6cq:~/spark/spark-1.3.0$ ./bin/spark-shell --master spark://192.168.1.10:7077 <./so1-works.sc 2>/dev/null
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.3.0
      /_/

Using Scala version 2.10.4 (OpenJDK 64-Bit Server VM, Java 1.8.0_40-internal)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
SQL context available as sqlContext.

scala> 

scala> val dataFrame = sqlContext.jsonFile("/data/so1.json")
dataFrame: org.apache.spark.sql.DataFrame = [age: bigint, name: string]

scala> dataFrame.show()
age name
21  abc 
30  def 
45  ghi 

scala> Stopping spark context.
paul@ki6cq:~/spark/spark-1.3.0$ paul@ki6cq:~/spark/spark-1.3.0$ 

Odd Note: Afterwards, I have to execute reset to get my linux terminal back to normal.

Ok, so, First, try shortening the example as I have done.

If that doesn't fix it, you can try duplicating my environment.

This might be straightforward as I use docker for the master and worker and have posted the images to the public dockerhub.

Note to future readers: My public dockerhub images are not official images for spark and are subject to change or removal.

You need two computers (one running Linux or a docker-compatible OS to host the master and worker within docker containers, the other also preferably running Linux or something with a spark-1.3.0 build) behind a home firewall router device (DLink, Netgrear, etc...). I assume the local network is 192.168.1.*, that 192.168.1.10 and .11 are free, and that the router will route properly, or you know how to make it route properly. You can change these addresses in the run script below.

If you just have one computer, due to particulars of bridging the networking methods I have used here will probably not work properly to communicate back to the host.
It can be made to work, but is a bit more than I'd like to add to an already long posting.

On one Linux computer, install docker, the pipework utility, and these shell scripts (adjusting memory granted to spark, editing out the extra workers doesn't seem to be necessary):

./run-docker-spark

#!/bin/bash
sudo -v
MASTER=$(docker run --name="master" -h master --add-host master:192.168.1.10 --add-host spark1:192.168.1.11 --add-host spark2:192.168.1.12 --add-host spark3:192.168.1.13 --add-host spark4:192.168.1.14 --expose=1-65535 --env SPARK_MASTER_IP=192.168.1.10 -d drpaulbrewer/spark-master:latest)
sudo pipework eth0 $MASTER 192.168.1.10/24@192.168.1.1
SPARK1=$(docker run --name="spark1" -h spark1 --add-host master:192.168.1.10 --add-host spark1:192.168.1.11 --add-host spark2:192.168.1.12 --add-host spark3:192.168.1.13 --add-host spark4:192.168.1.14 --expose=1-65535 --env mem=10G --env master=spark://192.168.1.10:7077 -v /data:/data -v /tmp:/tmp -d drpaulbrewer/spark-worker:latest)
sudo pipework eth0 $SPARK1 192.168.1.11/24@192.168.1.1

./stop-docker-spark

#!/bin/bash
docker kill master spark1
docker rm master spark1

The other linux computer will be your user computer, and needs a build of spark-1.3.0. Make a /data directory on both computers and install the json file there. Then run the ./run-docker-spark only once, on the computer that acts as combined host for the containers (like VMs) that will hold the master and worker. To stop the spark system, use the stop script. If you reboot, or there's a bad error, you need to run the stop script before the run script will work again.

Check to see that master and worker have paired at http://192.168.1.10:8080

If so, then you should be good to try the spark-shell command line up top.

You don't need these dockerfiles, because the builds are posted on the public dockerhub and downloads are automated in docker run. But here they are in case you want to see how things are built, the JDKs, maven command, etc.

I start with a common Dockerfile I place in a dir called spark-roasted-elephant, since this is a non-hadoop build and the O'Reilley book for hadoop had an elephant on it. You need a spark-1.3.0 source tarball from the spark website to drop into the directory with the Dockerfile. This Dockerfile probably doesn't expose enough ports (spark is very promiscuous about its use of ports, whereas docker unfortunately is designed to contain and document the use of ports), and the expose is overridden in the shell scripts that run the master and worker. This will cause some unhappiness if you ask docker to list what is running, because that includes a port list.

paul@home:/Z/docker$ cat ./spark-roasted-elephant/Dockerfile
# Copyright 2015 Paul Brewer http://eaftc.com
# License: MIT
# this docker file builds a non-hadoop version of spark for standalone experimentation
# thanks to article at http://mbonaci.github.io/mbo-spark/ for tips
FROM ubuntu:15.04
MAINTAINER drpaulbrewer@eaftc.com
RUN adduser --disabled-password --home /spark spark
WORKDIR /spark
ADD spark-1.3.0.tgz /spark/ 
WORKDIR /spark/spark-1.3.0
RUN sed -e 's/archive.ubuntu.com/www.gtlib.gatech.edu\/pub/' /etc/apt/sources.list > /tmp/sources.list && mv /tmp/sources.list /etc/apt/sources.list
RUN apt-get update && apt-get --yes upgrade \
    && apt-get --yes install sed nano curl wget openjdk-8-jdk scala \
    && echo "JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64" >>/etc/environment \
    && export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m" \
    && ./build/mvn -Phive -Phive-thriftserver -DskipTests clean package \
    && chown -R spark:spark /spark \
    && mkdir /var/run/sshd
EXPOSE 2222 4040 6066 7077 7777 8080 8081 

The the master is built from a directory ./spark-master using a dockerfile and a shell script to include into the container. Here's that dockerfile and shell script.

paul@home:/Z/docker$ cat ./spark-master/Dockerfile
FROM drpaulbrewer/spark-roasted-elephant:latest
MAINTAINER drpaulbrewer@eaftc.com
ADD my-spark-master.sh /spark/
USER spark
CMD /spark/my-spark-master.sh

paul@home:/Z/docker$ cat ./spark-master/my-spark-master.sh
#!/bin/bash -e
cd /spark/spark-1.3.0
# set SPARK_MASTER_IP to a net interface address, e.g. 192.168.1.10
export SPARK_MASTER_IP
./sbin/start-master.sh 
sleep 10000d

And for the worker:

paul@home:/Z/docker$ cat ./spark-worker/Dockerfile
FROM drpaulbrewer/spark-roasted-elephant:latest
MAINTAINER drpaulbrewer@eaftc.com
ADD my-spark-worker.sh /spark/
CMD /spark/my-spark-worker.sh
paul@home:/Z/docker$ cat ./spark-worker/my-spark-worker.sh
#!/bin/bash -e
cd /spark/spark-1.3.0
sleep 10
# dont use ./sbin/start-slave.sh it wont take numeric URL
mkdir -p /Z/data
mkdir -p /user/hive/warehouse
chown -R spark:spark /user
su -c "cd /spark/spark-1.3.0 && ./bin/spark-class org.apache.spark.deploy.worker.Worker --memory $mem $master" spark

Although by now this post has turned into the answer to "how do I make Dockerfiles for spark?" it isn't intended to be. These Dockerfiles are experimental for me, I do not use them in production and do not vouch for their quality. I disliked a well-regarded source for spark on Docker because they engaged in the lazy practice of chaining a bunch of containers together and it was huge and took forever to download. Here, there are much fewer layers and the download is smaller. This is posted not so much as a docker example but so you can determine what is different in your own environment.

这篇关于创建和显示从简单的JSON文件火花数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆