Dataproc上PySpark中的BigQuery连接器ClassNotFoundException [英] BigQuery connector ClassNotFoundException in PySpark on Dataproc

查看:29
本文介绍了Dataproc上PySpark中的BigQuery连接器ClassNotFoundException的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Dataproc在PySpark中运行脚本.

I'm trying to run a script in PySpark, using Dataproc.

该脚本有点像此示例以及我需要做的事情,因为我想检查一切是否正常.显然不是.

The script is kind of a merge between this example and what I need to do, as I wanted to check if everything works. Obviously, it doesn't.

我得到的错误是:

文件"/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",第328行,位于get_return_value中py4j.protocol.Py4JJavaError:调用z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD时发生错误.:java.lang.ClassNotFoundException:com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat

File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD. : java.lang.ClassNotFoundException: com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat

我确保我拥有所有的罐子,并按照其他类似帖子中的建议添加了一些新的罐子.我还检查了 SPARK_HOME 变量.

I made sure I have all the jars, added some new jars as suggested in other similar posts. I also checked the SPARK_HOME variable.

您可以在下面看到代码;尝试实例化table_data时出现错误.

Below you can see the code; the error appears when trying to instantiate table_data.

"""BigQuery I/O PySpark example."""
from __future__ import absolute_import
import json
import pprint
import subprocess
import pyspark
from pyspark.sql import SQLContext

sc = pyspark.SparkContext()

bucket = sc._jsc.hadoopConfiguration().get('fs.gs.system.bucket')
project = sc._jsc.hadoopConfiguration().get('fs.gs.project.id')
input_directory = 'gs://{}/hadoop/tmp/bigquery/pyspark_input'.format(bucket)

conf = {
    'mapred.bq.project.id': project,
    'mapred.bq.gcs.bucket': bucket,
    'mapred.bq.temp.gcs.path': input_directory,
    'mapred.bq.input.project.id': 'publicdata',
    'mapred.bq.input.dataset.id': 'samples',
    'mapred.bq.input.table.id': 'shakespeare',
}

output_dataset = 'wordcount_dataset'
output_table = 'wordcount_output'

table_data = sc.newAPIHadoopRDD(
    'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
    'org.apache.hadoop.io.LongWritable',
    'com.google.gson.JsonObject',
    conf=conf)

推荐答案

示例,则在提交作业时需要包括BigQuery连接器jar.

As pointed out in the example, you need to include BigQuery connector jar when submitting the job.

通过Dataproc作业API:

Through Dataproc jobs API:

gcloud dataproc jobs submit pyspark --cluster=${CLUSTER} \
    /path/to/your/script.py \
    --jars=gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar

从集群内部

spark-submit :

spark-submit --jars=gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar \
    /path/to/your/script.py

这篇关于Dataproc上PySpark中的BigQuery连接器ClassNotFoundException的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆