如何为 Scala Spark ETL 设置本地开发环境以在 AWS Glue 中运行? [英] How to set up a local development environment for Scala Spark ETL to run in AWS Glue?

查看:17
本文介绍了如何为 Scala Spark ETL 设置本地开发环境以在 AWS Glue 中运行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望能够在我的本地 IDE 中编写 Scala,然后将其部署到 AWS Glue 作为构建过程的一部分.但我无法找到构建 AWS 生成的 GlueApp 框架所需的库.

I'd like to be able to write Scala in my local IDE and then deploy it to AWS Glue as part of a build process. But I'm having trouble finding the libraries required to build the GlueApp skeleton generated by AWS.

aws-java-sdk-glue 没有't 包含导入的类,我在其他任何地方都找不到这些库.虽然它们必须存在于某个地方,但也许它们只是这个库的 Java/Scala 端口:aws-glue-libs

The aws-java-sdk-glue doesn't contain the classes imported, and I can't find those libraries anywhere else. Though they must exist somewhere, but perhaps they are just a Java/Scala port of this library: aws-glue-libs

来自 AWS 的模板 Scala 代码:

The template scala code from AWS:

import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.MappingSpec
import com.amazonaws.services.glue.errors.CallSite
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._

object GlueApp {
  def main(sysArgs: Array[String]) {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    // @params: [JOB_NAME]
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    Job.init(args("JOB_NAME"), glueContext, args.asJava)
    // @type: DataSource
    // @args: [database = "raw-tickers-oregon", table_name = "spark_delivery_2_1", transformation_ctx = "datasource0"]
    // @return: datasource0
    // @inputs: []
    val datasource0 = glueContext.getCatalogSource(database = "raw-tickers-oregon", tableName = "spark_delivery_2_1", redshiftTmpDir = "", transformationContext = "datasource0").getDynamicFrame()
    // @type: ApplyMapping
    // @args: [mapping = [("exchangeid", "int", "exchangeid", "int"), ("data", "struct", "data", "struct")], transformation_ctx = "applymapping1"]
    // @return: applymapping1
    // @inputs: [frame = datasource0]
    val applymapping1 = datasource0.applyMapping(mappings = Seq(("exchangeid", "int", "exchangeid", "int"), ("data", "struct", "data", "struct")), caseSensitive = false, transformationContext = "applymapping1")
    // @type: DataSink
    // @args: [connection_type = "s3", connection_options = {"path": "s3://spark-ticker-oregon/target", "compression": "gzip"}, format = "json", transformation_ctx = "datasink2"]
    // @return: datasink2
    // @inputs: [frame = applymapping1]
    val datasink2 = glueContext.getSinkWithFormat(connectionType = "s3", options = JsonOptions("""{"path": "s3://spark-ticker-oregon/target", "compression": "gzip"}"""), transformationContext = "datasink2", format = "json").writeDynamicFrame(applymapping1)
    Job.commit()
  }
}

build.sbt 我已经开始为本地构建组合起来:

And the build.sbt I have started putting together for a local build:

name := "aws-glue-scala"

version := "0.1"

scalaVersion := "2.11.12"

updateOptions := updateOptions.value.withCachedResolution(true)

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.1"

AWS 的文档Glue Scala API 似乎概述了与 AWS Glue Python 库中可用的功能类似的功能.那么也许所需要做的就是下载并构建 PySpark AWS Glue 库并将其添加到类路径中?也许有可能,因为 Glue python 库使用 Py4J.

The documentation for AWS Glue Scala API seems to outline similar functionality as is available in the AWS Glue Python library. So perhaps all that is required is to download and build the PySpark AWS Glue library and add it on the classpath? Perhaps possible since the Glue python library uses Py4J.

推荐答案

现在支持,AWS 最近发布的版本.

now it supports, a recent release from AWS.

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html

这篇关于如何为 Scala Spark ETL 设置本地开发环境以在 AWS Glue 中运行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆