无法在AWS Glue PySpark Dev Endpoint中正确运行脚本 [英] Unable to run scripts properly in AWS Glue PySpark Dev Endpoint

查看:97
本文介绍了无法在AWS Glue PySpark Dev Endpoint中正确运行脚本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经配置了一个AWS Glue开发人员终端节点,并且可以在pyspark REPL shell中成功连接到它-像这样

I've configured an AWS Glue dev endpoint and can connect to it successfully in a pyspark REPL shell - like this https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint-tutorial-repl.html

与AWS文档中给出的示例不同,我在开始会话时会收到WARNings,稍后在AWS Glue DynamicFrame结构上进行的各种操作都会失败.这是启动会话的完整日志-请注意关于spark.yarn.jars和PyGlue.zip的错误:

Unlike the example given in the AWS documentation I receive WARNings when I begin the session, and later on various operations on AWS Glue DynamicFrame structures fail. Here's the full log on starting the session - note the errors about spark.yarn.jars and PyGlue.zip:

Python 2.7.12 (default, Sep  1 2016, 22:14:00)
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/share/aws/glue/etl/jars/glue-assembly.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
18/03/02 14:18:58 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
18/03/02 14:19:03 WARN Client: Same path resource file:/usr/share/aws/glue/etl/python/PyGlue.zip added multiple times to distributed cache.
18/03/02 14:19:13 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.0
      /_/

Using Python version 2.7.12 (default, Sep  1 2016 22:14:00)
SparkSession available as 'spark'.
>>>

许多操作都按预期工作,但是我也收到一些不受欢迎的异常,例如,我可以从Glue目录中加载数据,检查其结构和其中的数据,但是无法对其应用Map或将其转换为DF.这是我完整的执行运行日志(除了最长的错误消息).前几个命令和设置都可以正常运行,但最后两个操作失败:

Many operations work as I expect but I also receive some unwelcome exceptions, for example I can load data from my Glue catalog inspect its structure and the data within, but I can't apply a Map to it, or convert it to a DF. Here's my full execution run log (apart from the longest error message). The first few commands and setup all work well, but the final two operations fail:

>>> import sys
>>> from awsglue.transforms import *
>>> from awsglue.utils import getResolvedOptions
>>> from pyspark.context import SparkContext
>>> from awsglue.context import GlueContext
>>> from awsglue.job import Job
>>>
>>> glueContext = GlueContext(spark)
>>> # Receives a string of the format yyyy-mm-dd hh:mi:ss.nnn and returns the first 10 characters: yyyy-mm-dd
... def TruncateTimestampString(ts):
...   ts = ts[:10]
...   return ts
...
>>> TruncateTimestampString('2017-03-05 06:12:08.376')
'2017-03-05'
>>>
>>> # Given a record with a timestamp property returns a record with a new property, day, containing just the date portion of the timestamp string, expected to be yyyy-mm-dd.
... def TruncateTimestamp(rec):
...   rec[day] = TruncateTimestampString(rec[timestamp])
...   return rec
...
>>> # Get the history datasource - WORKS WELL BUT LOGS log4j2 ERROR
>>> datasource_history_1 = glueContext.create_dynamic_frame.from_catalog(database = "dev", table_name = "history", transformation_ctx = "datasource_history_1")
ERROR StatusLogger No log4j2 configuration file found. Using default configuration: logging only errors to the console.
>>> # Tidy the history datasource - WORKS WELL
>>> history_tidied = datasource_history_1.drop_fields(['etag', 'jobmaxid', 'jobminid', 'filename']).rename_field('id', 'history_id')
>>> history_tidied.printSchema()
root
|-- jobid: string
|-- spiderid: long
|-- timestamp: string
|-- history_id: long

>>> # Trivial observation of the SparkSession objects
>>> SparkSession
<class 'pyspark.sql.session.SparkSession'>
>>> spark
<pyspark.sql.session.SparkSession object at 0x7f8668f3b650>
>>> 
>>> 
>>> # Apply a mapping to the tidied history datasource. FAILS
>>> history_mapped = history_tidied.map(TruncateTimestamp)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/mnt/tmp/spark-1f0341db-5de6-4008-974f-a1d194524a86/userFiles-6a67bdee-7c44-46d6-a0dc-9daa7177e7e2/PyGlue.zip/awsglue/dynamicframe.py", line 101, in map
File "/mnt/tmp/spark-1f0341db-5de6-4008-974f-a1d194524a86/userFiles-6a67bdee-7c44-46d6-a0dc-9daa7177e7e2/PyGlue.zip/awsglue/dynamicframe.py", line 105, in mapPartitionsWithIndex
File "/usr/lib/spark/python/pyspark/rdd.py", line 2419, in __init__
    self._jrdd_deserializer = self.ctx.serializer
AttributeError: 'SparkSession' object has no attribute 'serializer'
>>> history_tidied.toDF()
ERROR
Huge error log and stack trace follows, longer than my console can remember. Here's how it finishes:
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/mnt/tmp/spark-1f0341db-5de6-4008-974f-a1d194524a86/userFiles-6a67bdee-7c44-46d6-a0dc-9daa7177e7e2/PyGlue.zip/awsglue/dynamicframe.py", line 128, in toDF
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 79, in deco
    raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':"

我认为我正在遵循Amazon在他们的Dev Endpoint REPL指令中给出的指令,但是由于这些相当简单的操作(DynamicFrame.join和DynamicFrame.toDF)失败了,所以当我想运行真正的工作(似乎成功了,但是我的DynamicFrame.printSchema()和DynamicFrame.show()命令未显示在CloudWatch日志中以供执行).

I think I'm following the instructions given by Amazon in their Dev Endpoint REPL instructions, but with these fairly simple operations (DynamicFrame.join and DynamicFrame.toDF) failing I'm working in the dark when I want to run the job for real (which seems to succeed, but my DynamicFrame.printSchema() and DynamicFrame.show() commands don't show up in the CloudWatch logs for the execution).

有人知道我需要做什么来修复我的REPL环境,以便我可以正确地测试pyspark AWS Glue脚本吗?

Does anyone know what do I need to do to fix my REPL environment so that I can properly test pyspark AWS Glue scripts?

推荐答案

AWS支持人员最终回答了我关于此问题的查询.这是回应:

AWS Support finally responded to my query on this issue. Here's the response:

在进一步研究中,我发现这是PySpark shell的已知问题,并且胶水服务团队已在此问题上.该修补程序应尽快部署,但是目前没有我可以与您共享的ETA.

On researching further, I found that this is a known issue with the PySpark shell and glue service team is already on it. The fix should be deployed soon, however currently there's no ETA that I can share with you.

与此同时,这是一种解决方法:在初始化Glue上下文之前,您可以

Meanwhile here's a workaround: before initializing Glue context, you can do

>> newconf = sc._conf.set("spark.sql.catalogImplementation", "in-memory")
>> sc.stop()
>> sc = sc.getOrCreate(newconf)

,然后从该sc实例化胶水上下文.

and then instantiate glueContext from that sc.

我可以确认这对我有用,这是我能够运行的脚本:

I can confirm this works for me, here's a script I was able to run:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

# New recommendation from AWS Support 2018-03-22
newconf = sc._conf.set("spark.sql.catalogImplementation", "in-memory")
sc.stop()
sc = sc.getOrCreate(newconf)
# End AWS Support Workaround

glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

datasource_history_1 = glueContext.create_dynamic_frame.from_catalog(database = "dev", table_name = "history", transformation_ctx = "datasource_history_1")

def DoNothingMap(rec):
  return rec

history_mapped = datasource_history_1.map(DoNothingMap)
history_df = history_mapped.toDF()
history_df.show()
history_df.printSchema()

以前,.map().toDF()调用将失败.

我已要求AWS支持人员在此问题解决后通知我,以便不再需要解决方法.

I've asked AWS Support to notify me when this issue has been resolved so that the workaround is no longer required.

这篇关于无法在AWS Glue PySpark Dev Endpoint中正确运行脚本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆