如何一次(在多个Jupyter Notebook中)运行多个Spark 2.0实例? [英] How to run multiple instances of Spark 2.0 at once (in multiple Jupyter Notebooks)?

查看:428
本文介绍了如何一次(在多个Jupyter Notebook中)运行多个Spark 2.0实例?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个脚本,可以方便地在Jupyter Notebook中使用Spark.这很棒,除非我在第二个笔记本中运行spark命令(例如,测试一些草稿工作).

I have a script which conveniently allows me to use Spark in a Jupyter Notebook. This is great, except when I run spark commands in a second notebook (for instance to test out some scratch work).

我收到一条很长的错误消息,其中的关键部分似乎是:

I get a very long error message the key parts of which seem to be:

Py4JJavaError:调用o31.json时发生错误. :java.lang.RuntimeException:java.lang.RuntimeException:无法实例化org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient`

Py4JJavaError: An error occurred while calling o31.json. : java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient`

. . .

由以下原因引起:错误XSDB6:Derby的另一个实例可能已经启动了数据库/metastore_db

Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /metastore_db

问题似乎是我一次只能运行一个Spark实例.

The problem seems to be that I can only run one instance of Spark at a time.

如何设置Spark一次在多个笔记本中运行?

How can I set up Spark to run in multiple notebooks at once?

推荐答案

默认情况下,Spark在Hive和Hadoop之上运行,并将其有关数据库转换的指令存储在轻量级数据库系统Derby中. Derby一次只能运行一个Spark实例,因此,当您启动第二个笔记本并开始运行Spark命令时,它会崩溃.

By default Spark runs on top of Hive and Hadoop, and stores its instructions for database transformations in Derby - a light weight database system. Derby can only run one Spark instance at a time, so when you start a second notebook and start running Spark commands, it crashes.

要解决此问题,您可以将Spark的Hive安装连接到Postgres而不是Derby.

To get around this you can connect Spark's Hive installation to Postgres instead of Derby.

如果尚未安装postgres,则进行安装.

Brew install postgres, if you do not have it installed already.

然后下载postgresql-9.4.1212.jar(假设您正在运行Java 1.8 aka Java8) 来自 https://jdbc.postgresql.org/download.html

Then download postgresql-9.4.1212.jar (assuming you are running java 1.8 aka java8) from https://jdbc.postgresql.org/download.html

将此.jar文件移动到/libexec/jars/目录中,以进行Spark安装.

Move this .jar file to the /libexec/jars/ directory for your Spark installation.

例如:/usr/local/Cellar/apache-spark/2.0.1/

(在Mac上,您可以通过在命令行中键入brew info apache-spark来找到Spark的安装位置)

(on Mac you can find where Spark is installed by typing brew info apache-spark in the command line)

接下来,在/libexec/conf目录中为您的Spark安装创建hive-site.xml.

Next create hive-site.xml in the /libexec/conf directory for your Spark installation.

例如:/usr/local/Cellar/apache-spark/2.0.1/libexec/conf

这可以通过文本编辑器完成-只需保存扩展名为'.xml'的文件即可.

This can be done through a text editor - just save the file with a '.xml' extension.

hive-site.xml应包含以下文本:

hive-site.xml should contain the following text:

<configuration>
<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:postgresql://localhost:5432/hive_metastore</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>org.postgresql.Driver</value>
</property>

<property>
<name>javax.jdo.option.ConnectionUserName</name>
  <value>hive</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>mypassword</value>
</property>

</configuration>

'hive'和'mypassword'可以替换为您认为有意义的任何东西-但必须与下一步匹配.

'hive' and 'mypassword' can be replaced with whatever makes sense to you - but must match with the next step.

最后在Postgress中创建一个用户和密码:在命令行中运行以下命令-

Finally create a user and password in Postgress: in the command line run the following commands -

psql
CREATE USER hive;
ALTER ROLE hive WITH PASSWORD 'mypassword';
CREATE DATABASE hive_metastore;
GRANT ALL PRIVILEGES ON DATABASE hive_metastore TO hive;
\q

就这样,您完成了.现在,Spark应该可以同时在多个Jupyter Notebook中运行.

Thats it, you're done. Spark should now run in multiple Jupyter Notebooks simultaneously.

这篇关于如何一次(在多个Jupyter Notebook中)运行多个Spark 2.0实例?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆