如何在emr-5.2.1上产生火花以写入dynamodb? [英] How can I get spark on emr-5.2.1 to write to dynamodb?

查看:146
本文介绍了如何在emr-5.2.1上产生火花以写入dynamodb?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据

According to this article here, when I create an aws emr cluster that will use spark to pipe data to dynamodb, I need to preface with the line:

spark-shell --jars /usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar

此行出现在许多参考资料中,包括来自亚马逊开发人员本身的 .但是,当我使用添加的-jars 标志运行 create-cluster 时,出现此错误:

This line appears in numerous references, including from the amazon devs themselves. However, when I run create-cluster with an added --jars flag, I get this error:

Exception in thread "main" java.io.FileNotFoundException: File file:/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:616)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:829)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:606)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:431)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
...

这里有一个答案这样的问题是该库应该包含在emr-5.2.1中,所以我尝试在没有额外的-jars 标志的情况下运行我的代码:

There's an answer at this SO question that the library should be included in emr-5.2.1, so I tried running my code without that extra --jars flag:

ERROR ApplicationMaster: User class threw exception: java.lang.NoClassDefFoundError: org/apache/hadoop/dynamodb/DynamoDBItemWritable
java.lang.NoClassDefFoundError: org/apache/hadoop/dynamodb/DynamoDBItemWritable
at CopyS3ToDynamoApp$.main(CopyS3ToDynamo.scala:113)
at CopyS3ToDynamoApp.main(CopyS3ToDynamo.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:627)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.dynamodb.DynamoDBItemWritable
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

我只是笑了,我尝试通过添加-driver-class-path,/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar,到我的脚步,并被告知:

Just for grins, I tried the alternative proposed by that other answer to that question by adding in --driver-class-path,/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar, to my step, and got told:

Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2702)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2715)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93)

无法找到 s3a.S3AFileSystem 似乎是一个很大的任务,尤其是因为我还有其他可以从s3读取的作业,但是显然从s3读取并写入dynamo非常棘手.关于如何解决此问题的任何想法吗?

Not being able to find s3a.S3AFileSystem seems like a big one, especially since I have other jobs that read from s3 just fine, but apparently reading from s3 and writing to dynamo is tricky. Any idea on how to solve this problem?

更新:我发现未找到s3,因为我覆盖了类路径并删除了所有其他库,因此我像这样更新了类路径:

Update: I figured that s3 wasn't being found because I was overriding the classpath and dropping all the other libraries, so I updated classpath like so:

class_path = "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:" \
             "/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:" \
             "/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:" \
             "/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:" \
             "/usr/share/aws/emr/ddb/lib/*"

现在我收到此错误:

 diagnostics: User class threw exception: java.lang.NoClassDefFoundError: org/apache/hadoop/dynamodb/DynamoDBItemWritable
 ApplicationMaster host: 10.178.146.133
 ApplicationMaster RPC port: 0
 queue: default
 start time: 1484852731196
 final status: FAILED
 tracking URL: http://ip-10-178-146-68.syseng.tmcs:20888/proxy/application_1484852606881_0001/

因此,该库似乎不在AWS文档指定的位置.有没有人得到这个工作?

So it looks like the library isn't in the location specified by the AWS documentation. Has anyone gotten this to work?

推荐答案

好吧,弄清楚这花了我几天的时间,所以我将不理会下一个来的人问这个问题.

OK, figuring this out took me days, so I'll spare whoever comes along next to ask this question.

这些方法失败的原因是,AWS人员指定的路径在emr 5.2.1群集上不存在(也许根本不在任何emr 5.0群集上).

The reason that these methods fail is that the path specified by the AWS folks does not exist on emr 5.2.1 clusters (and maybe not on any emr 5.0 cluster at all).

因此,我下载了来自Maven的emr-dynamodb-hadoop罐子.

由于该jar不在emr群集上,因此您需要将其包含在jar中.如果您使用的是sbt,则可以使用 sbt组装.如果您不希望这样的整体jar发生(并且必须弄清楚1.7和1.8版本的netbeans之间的冲突解决方案),则也可以

Because the jar is not on the emr cluster, you're going to need to include it in your jar. If you're using sbt, you can use sbt assembly. If you don't want to have such a monolithic jar going on (and have to figure out the conflict resolution between version 1.7 and 1.8 of netbeans), you can also just merge jars as part of your build process. This way, you have one jar for your emr step that you can put on s3 for easy create-cluster based on-demand spark jobs.

这篇关于如何在emr-5.2.1上产生火花以写入dynamodb?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆