如何为Scala Spark ETL设置本地开发环境以在AWS Glue中运行? [英] How to set up a local development environment for Scala Spark ETL to run in AWS Glue?

查看:203
本文介绍了如何为Scala Spark ETL设置本地开发环境以在AWS Glue中运行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望能够在本地IDE中编写Scala,然后在构建过程中将其部署到AWS Glue.但是我很难找到构建AWS生成的GlueApp框架所需的库.

I'd like to be able to write Scala in my local IDE and then deploy it to AWS Glue as part of a build process. But I'm having trouble finding the libraries required to build the GlueApp skeleton generated by AWS.

aws-java-sdk-glue 不会"包含导入的类,而我在其他任何地方都找不到这些库.尽管它们必须存在于某个地方,但也许它们只是该库的Java/Scala端口: aws-glue -libs

The aws-java-sdk-glue doesn't contain the classes imported, and I can't find those libraries anywhere else. Though they must exist somewhere, but perhaps they are just a Java/Scala port of this library: aws-glue-libs

AWS的模板Scala代码:

The template scala code from AWS:

import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.MappingSpec
import com.amazonaws.services.glue.errors.CallSite
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._

object GlueApp {
  def main(sysArgs: Array[String]) {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    // @params: [JOB_NAME]
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    Job.init(args("JOB_NAME"), glueContext, args.asJava)
    // @type: DataSource
    // @args: [database = "raw-tickers-oregon", table_name = "spark_delivery_2_1", transformation_ctx = "datasource0"]
    // @return: datasource0
    // @inputs: []
    val datasource0 = glueContext.getCatalogSource(database = "raw-tickers-oregon", tableName = "spark_delivery_2_1", redshiftTmpDir = "", transformationContext = "datasource0").getDynamicFrame()
    // @type: ApplyMapping
    // @args: [mapping = [("exchangeid", "int", "exchangeid", "int"), ("data", "struct", "data", "struct")], transformation_ctx = "applymapping1"]
    // @return: applymapping1
    // @inputs: [frame = datasource0]
    val applymapping1 = datasource0.applyMapping(mappings = Seq(("exchangeid", "int", "exchangeid", "int"), ("data", "struct", "data", "struct")), caseSensitive = false, transformationContext = "applymapping1")
    // @type: DataSink
    // @args: [connection_type = "s3", connection_options = {"path": "s3://spark-ticker-oregon/target", "compression": "gzip"}, format = "json", transformation_ctx = "datasink2"]
    // @return: datasink2
    // @inputs: [frame = applymapping1]
    val datasink2 = glueContext.getSinkWithFormat(connectionType = "s3", options = JsonOptions("""{"path": "s3://spark-ticker-oregon/target", "compression": "gzip"}"""), transformationContext = "datasink2", format = "json").writeDynamicFrame(applymapping1)
    Job.commit()
  }
}

我已经开始将build.sbt放在一起进行本地构建:

And the build.sbt I have started putting together for a local build:

name := "aws-glue-scala"

version := "0.1"

scalaVersion := "2.11.12"

updateOptions := updateOptions.value.withCachedResolution(true)

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.1"

AWS的文档Glue Scala API 似乎概述了与AWS Glue Python库中可用的功能相似的功能.因此,也许所需要做的就是下载并构建PySpark AWS Glue库并将其添加到类路径中?也许是可能的,因为Glue python库使用Py4J .

The documentation for AWS Glue Scala API seems to outline similar functionality as is available in the AWS Glue Python library. So perhaps all that is required is to download and build the PySpark AWS Glue library and add it on the classpath? Perhaps possible since the Glue python library uses Py4J.

推荐答案

现在支持,这是AWS的最新版本.

now it supports, a recent release from AWS.

https://docs .aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html

这篇关于如何为Scala Spark ETL设置本地开发环境以在AWS Glue中运行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆