Firehose JSON ->S3 实木复合地板 ->ETL Spark,错误:无法推断 Parquet 的架构 [英] Firehose JSON -> S3 Parquet -> ETL Spark, error: Unable to infer schema for Parquet

查看:21
本文介绍了Firehose JSON ->S3 实木复合地板 ->ETL Spark,错误:无法推断 Parquet 的架构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

看起来这应该很容易,就像它是这组功能的核心用例一样,但它却是一个接一个的问题.

It seems like this should be easy, like it's a core use case of this set of features, but it's been problem after problem.

最新的尝试是通过 Glue Dev 端点(PySpark 和 Scala 端点)运行命令.

The latest is in trying to run commands via a Glue Dev endpoint (both the PySpark and Scala end-points).

按照此处的说明操作:https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint-tutorial-repl.html

import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.transforms import *
glueContext = GlueContext(SparkContext.getOrCreate())
df = glueContext.create_dynamic_frame.from_catalog(database="mydb", table_name="mytable")

产生这个错误:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/mnt/tmp/spark-0ba544c9-0b5a-42bb-8a54-9b3138205179/userFiles-95708c09-59f6-4b6d-9016-d0375d511b7a/PyGlue.zip/awsglue/dynamicframe.py", line 557, in from_catalog
  File "/mnt/tmp/spark-0ba544c9-0b5a-42bb-8a54-9b3138205179/userFiles-95708c09-59f6-4b6d-9016-d0375d511b7a/PyGlue.zip/awsglue/context.py", line 136, in create_dynamic_frame_from_catalog
  File "/mnt/tmp/spark-0ba544c9-0b5a-42bb-8a54-9b3138205179/userFiles-95708c09-59f6-4b6d-9016-d0375d511b7a/PyGlue.zip/awsglue/data_source.py", line 36, in getFrame
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'

它还在设置行之一中生成此警告:

It also generates this warning, in one of the setup lines:

18/06/26 19:09:35 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.

整体设置非常简单:我们有一个传入的 Kinesis 数据流,一个用于该流的处理器,该处理器生成一个 JSON Kinesis 数据流,一个 Kinesis 流,配置为将该 JSON 流写入 S3 中的 Parquet 文件,然后是需要 Glue 目录配置才能实现这一点.

The overall setup it pretty simple: we have an incoming Kinesis data stream, a processor for that stream that produces a JSON Kinesis data stream, a Kinesis firehose stream configured to write that JSON stream to Parquet files in S3, and then the required Glue catalogue configurations to make that happen.

Athena 可以很好地查看数据,但 Scala/PySpark 脚本出错.

Athena can see the data just fine, but the Scala/PySpark scripts error out.

有什么想法/建议吗?

推荐答案

好的,仍然不清楚为什么会发生这种情况,但是,找到了解决方法!

Okay, still not clear why this was happening, but, found a fix!

基本上,而不是使用生成的代码:

Basically, instead of using the generated code:

val datasource0 = glueContext.getCatalogSource(
        database = "db",
        tableName = "myTable",
        redshiftTmpDir = "",
        transformationContext = "datasource0"
    ).getDynamicFrame()

使用此代码:

val crawledData = glueContext.getSourceWithFormat(
        connectionType = "s3",
        options = JsonOptions(s"""{"path": "s3://mybucket/mytable/*/*/*/*/"}"""),
        format = "parquet",
        transformationContext = "source"
    ).getDynamicFrame()

这里的关键位似乎是 */*/*/*/ - 如果我只是指定根文件夹,我会得到 Parquet 错误,并且(显然)正常的 /**/* 通配符不起作用.

The key bit here seemed to be the */*/*/*/ - if I just specified the root folder, I would get the Parquet error, and (apparently) the normal /**/* wildcard wouldn't work.

这篇关于Firehose JSON ->S3 实木复合地板 ->ETL Spark,错误:无法推断 Parquet 的架构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆