通过Web应用程序启动Spark应用程序的最佳实践? [英] Best Practice to launch Spark Applications via Web Application?

查看:310
本文介绍了通过Web应用程序启动Spark应用程序的最佳实践?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



基本上,用户可以决定他想要运行哪个动作并输入一些变量,它需要传递给火花应用程序。
例如:用户输入几个字段,然后单击一个按钮,该按钮用参数min_x,max_x,min_y,max_y执行 sparkApp1 。

Spark应用程序应该使用用户提供的参数启动。完成后,可能需要Web应用程序来检索结果(从hdfs或mongodb)并将其显示给用户。在处理时,Web应用程序应该显示Spark应用程序的状态。



我的问题:


  • Web应用程序如何启动Spark应用程序?它可能能够从命令行下启动它,但可能有更好的方法来完成此操作。
  • Web应用程序如何访问Spark应用程序的当前状态?从Spark WebUI的REST API获取状态的方法是什么?



我正在运行一个Spark 1.6.1集群, YARN / Mesos(还不确定)和MongoDB。

解决方案

非常基本的答案:

>基本上你可以使用 SparkLauncher 类来启动Spark应用程序并添加一些侦听器来观察进度。



然而,您可能对Livy服务器感兴趣,它是Spark作业的RESTful Sever。据我所知,Zeppelin正在使用Livy提交作业并检索状态。

您还可以使用Spark REST界面来检查状态,信息会更加精确。 这里有一个示例,说明如何通过REST API提交作业



您有3个选项,答案是 - 自己检查;)这取决于您的项目和要求。这两个主要选项:


  • SparkLauncher + Spark REST界面

  • Livy服务器




  • 应该对你有好处,你必须检查在你的项目中使用什么更容易和更好。



    < h2>扩展答案

    您可以以不同的方式在应用程序中使用Spark,具体取决于您需要什么以及您喜欢什么。

    SparkLauncher



    SparkLauncher 是一个来自 spark-launcher artifact的类。它用于像Spark Submit一样启动已经准备好的Spark作业。



    典型用法是:

    1 )使用Spark作业构建项目并将JAR文件复制到所有节点
    2)从您的客户端应用程序(即Web应用程序)创建指向准备的JAR文件的SparkLauncher

    <
    .setJavaHome(JAVA_HOME)
    .setAppResource(pathToJARFile)
    。 setMainClass(MainClassFromJarWithJob)
    .setMaster(MasterAddress
    .startApplication();
    //或:.launch()。waitFor()


    $ b

    startApplication 创建SparkAppHandle,它允许您添加监听器并停止应用程序,它还提供了 getAppId



    SparkLauncher应该用于Spark REST API。您可以查询 http:// driverNode:4040 / API / V1 /应用/ * * ResultFromGetAppId /岗位,您将获得关于应用程序当前状态的信息。



    Spark REST API



    也可以通过RESTful API直接提交Spark作业。用法与 SparkLauncher 非常相似,但它是以纯粹的RESTful方式完成的。



    示例请求 - 本文的积分:

      curl -X POST http:// spark-master-host:6066 / v1 / submissions / create --headerContent-Type:application / json; charset = UTF-8--data'{
    action: CreateSubmissionRequest,
    appArgs:[myAppArgument1],
    appResource:hdfs:///filepath/spark-job-1.0.jar,
    clientSparkVersion: 1.5.0,
    environmentVariables:{
    SPARK_ENV_LOADED:1
    },
    mainClass:spark.ExampleJobInPreparedJar,
    sparkProperties:{
    spark.jars:hdfs:///filepath/spark-job-1.0.jar,
    spark.driver.supervise:false,
    spark.app.name:ExampleJobInPreparedJar,
    spark.eventLog.enabled:true,
    spark.submit.deployMode:cluster,
    spark.master:spark:// spark-cluster-ip:6066
    }
    }'

    该命令将在 ExampleJobInPreparedJar 类以给定Spark Master进行集群。在响应中,您将有 submissionId 字段,这将有助于检查应用程序的状态 - 只需调用另一个服务: curl http:// spark-簇-IP:6066 / V1 /提交/状态/ submissionIdFromResponse 。这就是它,无需编写代码



    Livy REST服务器和Spark Job Server



    Livy REST服务器 Spark Job Server 是RESTful应用程序,它允许您通过RESTful Web服务提交作业。这两者与Spark的REST接口之间的一个主要区别是,Livy和SJS不需要提前准备作业并打包到JAR文件中。您只需提交将在Spark中执行的代码。



    用法非常简单。代码取自Livy存储库,但为了提高可读性而进行了一些修改。1)案例1:提交作业,放置在本地计算机中$ b $(

    $ 1) b

      //创建客户端
    LivyClient客户端= new LivyClientBuilder()
    .setURI(new URI(livyUrl))
    。建立();

    尝试{
    //发送并提交JAR文件
    client.uploadJar(new File(piJar))。get();
    // PiJob是实现Livy Job
    double pi = client.submit(new PiJob(samples))。get();
    } finally {
    client.stop(true);




    $ b $ 2)案例2:创建和执行动态作业

      // Python中的示例。数据包含Scala中的代码,将在Spark 
    data = {
    'code'中执行:textwrap.dedent(\
    val NUM_SAMPLES = 100000;
    val count = sc.parallelize(1到NUM_SAMPLES).map {i =>
    val x = Math.random();
    val y = Math.random();
    if(x * x + y * y< 1)1 else 0
    } .reduce(_ + _);
    println(\Pi大致为\+ 4.0 * count / NUM_SAMPLES)


    r = requests.post(statements_url,data = json.dumps(data),headers = headers)
    pprint.pprint(r.json ())

    正如您所看到的,Spark的预编译作业和临时查询都是可能。

    Hydrosphere Mist



    另一个Spark as a Service应用程序。 Mist 非常简单,类似于Livy和Spark Job Server。



    用法非常相似



    1)创建工作文件:

      import io.hydrosphere.mist.MistJob 

    object MyCoolMistJob extends MistJob {
    def doStuff(parameters:Map [String,Any]):Map [String,Any] = {
    val rdd = context.parallelize()
    ...
    return result.asInstance [Map [String,Any]]
    }
    }

    2)将作业文件打包成JAR
    3)将请求发送给Mist:

      curl --headerContent-Type:application / json-X POST http:// mist_http_host:mist_http_port / jobs --data'{path :/path_to_jar/mist_examples.jar,className:SimpleContext $,parameters:{digits:[1,2,3,4,5,6,7,8,9,0]} ,namespace:foo}'

    是它具有通过 MQTT

    Apache Toree



    Apache Toree 的创建旨在为Spark提供简单的交互式分析。它不需要建立任何JAR。它通过IPython协议工作,但不仅支持Python。

    目前文档主要关注Jupyter笔记本支持,但也有REST风格的API。

    比较和结论



    我列出了几个选项:


    1. SparkLauncher

    2. Spark REST API

    3. Livy REST服务器和Spark Job Server

    4. Hydrosphere雾

    5. Apache Toree

    所有这些对于不同的用例。我可以区分几个类别:


    1. 需要具有作业的JAR文件的工具:Spark Launcher,Spark REST API
    2. 交互式和预先打包作业的​​工具:Livy,SJS,Mist
    3. 工具集中于交互式分析:Toree(但可能会支持预打包作业;否文档在此时发布)

    SparkLauncher非常简单,并且是Spark项目的一部分。您使用纯代码编写作业配置,因此比JSON对象构建起来更容易。

    对于完全REST风格的提交,请考虑Spark REST API,Livy,SJS和Mist。其中三个是稳定的项目,有一些生产用例。 REST API还需要预先打包作业,而Livy和SJS则不需要。不过请记住,Spark REST API在每个Spark发行版中都是默认的,而Livy / SJS则不是。我对Mist了解不多,但是 - 过了一段时间 - 它应该是集成所有类型Spark作业的非常好的工具。

    Toree专注于交互式工作。它仍在孵化,但即使现在你也可以检查它的可能性。



    为什么在内置REST API时使用自定义附加REST服务?像Livy这样的SaaS是Spark的入口点。它管理Spark上下文,并且只在一个节点上,而不是在群集之外的其他位置上。他们还支持交互式分析。 Apache Zeppelin使用Livy将用户代码提交给Spark


    I want to expose my Spark applications to the users with a web application.

    Basically, the user can decide which action he wants to run and enter a few variables, which need to get passed to the spark application. For example: The user enters a few fields and then clicks on a button which does the following "run sparkApp1 with paramter min_x, max_x, min_y, max_y".

    The spark application should be launched with the parameters given by the user. After finishing, the web application might be needed to retrieve the results (from hdfs or mongodb) and display them to the user. While processing, the Web Application should display the status of the Spark Application.

    My question:

    • How can the web application launch the Spark Application? It might be able to launch it from the command line under the hood but there might be a better way to do this.
    • How can the web application access the current status of the Spark Application? Is fetching the status from the Spark WebUI's REST API the way to go?

    I'm running a cluster of Spark 1.6.1 with YARN/Mesos (not sure yet) and MongoDB.

    解决方案

    Very basic answer:

    Basically you can use SparkLauncher class to launch Spark applications and add some listeners to watch progress.

    However you may be interested in Livy server, which is a RESTful Sever for Spark jobs. As far as I know, Zeppelin is using Livy to submit jobs and retrieve status.

    You can also use Spark REST interface to check state, information will be then more precise. Here there is an example how to submit job via REST API

    You've got 3 options, the answer is - check by yourself ;) It very depends on your project and requirements. Both 2 main options:

    • SparkLauncher + Spark REST interface
    • Livy server

    Should be good for you and you must just check what's easier and better to use in your project

    Extended answer

    You can use Spark from your application in different ways, depending on what you need and what you prefer.

    SparkLauncher

    SparkLauncher is a class from spark-launcher artifact. It is used to launch already prepared Spark jobs just like from Spark Submit.

    Typical usage is:

    1) Build project with your Spark job and copy JAR file to all nodes 2) From your client application, i.e. web application, create SparkLauncher which points to prepared JAR file

    SparkAppHandle handle = new SparkLauncher()
        .setSparkHome(SPARK_HOME)
        .setJavaHome(JAVA_HOME)
        .setAppResource(pathToJARFile)
        .setMainClass(MainClassFromJarWithJob)
        .setMaster("MasterAddress
        .startApplication();
        // or: .launch().waitFor()
    

    startApplication creates SparkAppHandle which allows you to add listeners and stop application. It also provides possibility to getAppId.

    SparkLauncher should be used with Spark REST API. You can query http://driverNode:4040/api/v1/applications/*ResultFromGetAppId*/jobs and you will have information about current status of an application.

    Spark REST API

    There is also possibility to submit Spark jobs directly via RESTful API. Usage is very similar to SparkLauncher, but it's done in pure RESTful way.

    Example request - credits for this article :

    curl -X POST http://spark-master-host:6066/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data '{
      "action" : "CreateSubmissionRequest",
      "appArgs" : [ "myAppArgument1" ],
      "appResource" : "hdfs:///filepath/spark-job-1.0.jar",
      "clientSparkVersion" : "1.5.0",
      "environmentVariables" : {
        "SPARK_ENV_LOADED" : "1"
      },
      "mainClass" : "spark.ExampleJobInPreparedJar",
      "sparkProperties" : {
        "spark.jars" : "hdfs:///filepath/spark-job-1.0.jar",
        "spark.driver.supervise" : "false",
        "spark.app.name" : "ExampleJobInPreparedJar",
        "spark.eventLog.enabled": "true",
        "spark.submit.deployMode" : "cluster",
        "spark.master" : "spark://spark-cluster-ip:6066"
      }
    }'
    

    This command will submit job in ExampleJobInPreparedJar class to cluster with given Spark Master. In the response you will have submissionId field, which will be helpful to check status of application - simply call another service: curl http://spark-cluster-ip:6066/v1/submissions/status/submissionIdFromResponse. That's it, nothing more to code

    Livy REST Server and Spark Job Server

    Livy REST Server and Spark Job Server are RESTful applications which allows you to submit jobs via RESTful Web Service. One major difference between those two and Spark's REST interface is that Livy and SJS doesn't require jobs to be prepared earlier and packed to JAR file. You are just submitting code which will be executed in Spark.

    Usage is very simple. Codes are taken from Livy repository, but with some cuts to improve readability

    1) Case 1: submitting job, that is placed in local machine

    // creating client
    LivyClient client = new LivyClientBuilder()
      .setURI(new URI(livyUrl))
      .build();
    
    try {
      // sending and submitting JAR file
      client.uploadJar(new File(piJar)).get();
      // PiJob is a class that implements Livy's Job
      double pi = client.submit(new PiJob(samples)).get();
    } finally {
      client.stop(true);
    }
    

    2) Case 2: dynamic job creation and execution

    // example in Python. Data contains code in Scala, that will be executed in Spark
    data = {
      'code': textwrap.dedent("""\
        val NUM_SAMPLES = 100000;
        val count = sc.parallelize(1 to NUM_SAMPLES).map { i =>
          val x = Math.random();
          val y = Math.random();
          if (x*x + y*y < 1) 1 else 0
        }.reduce(_ + _);
        println(\"Pi is roughly \" + 4.0 * count / NUM_SAMPLES)
        """)
    }
    
    r = requests.post(statements_url, data=json.dumps(data), headers=headers)
    pprint.pprint(r.json()) 
    

    As you can see, both pre-compiled jobs and ad - hoc queries to Spark are possible.

    Hydrosphere Mist

    Another Spark as a Service application. Mist is very simple and similar to Livy and Spark Job Server.

    Usage is very very similar

    1) Create job file:

    import io.hydrosphere.mist.MistJob
    
    object MyCoolMistJob extends MistJob {
        def doStuff(parameters: Map[String, Any]): Map[String, Any] = {
            val rdd = context.parallelize()
            ...
            return result.asInstance[Map[String, Any]]
        }
    } 
    

    2) Package job file into JAR 3) Send request to Mist:

    curl --header "Content-Type: application/json" -X POST http://mist_http_host:mist_http_port/jobs --data '{"path": "/path_to_jar/mist_examples.jar", "className": "SimpleContext$", "parameters": {"digits": [1, 2, 3, 4, 5, 6, 7, 8, 9, 0]}, "namespace": "foo"}'
    

    One strong thing, that I can see in Mist, is that it has out-of-the-box support for streaming jobs via MQTT.

    Apache Toree

    Apache Toree was created to enable easy interactive analitics for Spark. It doesn't require any JAR to be built. It's working via IPython protocol, but not only Python is supported.

    Currently documentation focuses on Jupyter notebook support, but there is also REST-style API.

    Comparison and conclusions

    I've listed few options:

    1. SparkLauncher
    2. Spark REST API
    3. Livy REST Server and Spark Job Server
    4. Hydrosphere Mist
    5. Apache Toree

    All of them are good for different use cases. I can distinguish few categories:

    1. Tools that requires JAR files with job: Spark Launcher, Spark REST API
    2. Tools for interactive and pre-packaged jobs: Livy, SJS, Mist
    3. Tools that focus on interactive analitics: Toree (however there may be some support for pre-packaged jobs; no documentation is published at this moment)

    SparkLauncher is very simple and is a part of Spark project. You are writing job configuration in plain code, so it can be easier to build than JSON objects.

    For fully RESTful-style submitting, consider Spark REST API, Livy, SJS and Mist. Three of them are stable projects, which have some production use cases. REST API also requires jobs to be pre-packaged and Livy and SJS don't. However remember, that Spark REST API is by default in each Spark distribution and Livy/SJS is not. I don't know much about Mist, but - after a while - it should be very good tool to integrate all types of Spark jobs.

    Toree is focusing on interactive jobs. It's still in incubation, but even now you can check it's possibilities.

    Why use custom, additional REST Service, when there is built-in REST API? SaaS like Livy is one entry point to Spark. It manages Spark context and is only on one node than can in other place than cluster. They also enables interactive analytics. Apache Zeppelin uses Livy to submit user's code to Spark

    这篇关于通过Web应用程序启动Spark应用程序的最佳实践?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆