用于Dataproc上Spark的BigQuery连接器-无法使用服务帐户密钥文件进行身份验证 [英] BigQuery connector for Spark on Dataproc - cannot authenticate using service account key file

查看：61 发布时间：2020/11/17 6:32:09 google-bigquery google-cloud-dataproc

本文介绍了用于Dataproc上Spark的BigQuery连接器-无法使用服务帐户密钥文件进行身份验证的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已遵循将BigQuery连接器与Spark配合使用成功从公开可用的数据集中获取数据.现在，我需要访问一个由我们的客户之一拥有的bigquery数据集，并为其提供了服务帐户密钥文件(我知道该服务帐户密钥文件是有效的，因为我可以使用它使用适用于Python的Google BigQuery库).

I have followed Use the BigQuery connector with Spark to successfully get data from a publicly available dataset. I now need to access a bigquery dataset that is owned by one of our clients and for which I have been given a service account key file (I know that the service account key file is valid because I can use it to connect using the Google BigQuery library for Python).

我已经按照Igor Dvorzhak的建议此处

I have followed what Igor Dvorzhak recommended here

要使用服务帐户密钥文件授权，您需要将mapred.bq.auth.service.account.enable属性设置为true，并使用mapred.bq.auth.service.account.json.keyfile属性

To use service account key file authorization you need to set mapred.bq.auth.service.account.enable property to true and point BigQuery connector to a service account json keyfile using mapred.bq.auth.service.account.json.keyfile property

像这样:

from pyspark.sql import SparkSession
from datetime import datetime

spark = SparkSession.builder.appName("SparkSessionBQExample").enableHiveSupport().getOrCreate()

bucket = spark._jsc.hadoopConfiguration().get('fs.gs.system.bucket')
project = spark._jsc.hadoopConfiguration().get('fs.gs.project.id')
input_directory =     'gs://{}/hadoop/tmp/bigquery/pyspark_input{}'.format(bucket, datetime.now().strftime("%Y%m%d%H%M%S"))

project_id = 'clientproject'#'publicdata'
dataset_id = 'clientdataset'#samples'
table_id = 'clienttable'#'shakespeare'
conf = {
    # Input Parameters.
    'mapred.bq.project.id': project,
    'mapred.bq.gcs.bucket': bucket,
    'mapred.bq.temp.gcs.path': input_directory,
    'mapred.bq.input.project.id': project_id,
    'mapred.bq.input.dataset.id': dataset_id,
    'mapred.bq.input.table.id': table_id,
    'mapred.bq.auth.service.account.enable': 'true'
}

# Load data in from BigQuery.
table_data = spark.sparkContext.newAPIHadoopRDD(
    'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
    'org.apache.hadoop.io.LongWritable',
    'com.google.gson.JsonObject',
    conf=conf)

print ('row tally={}'.format(table_data.toDF().count()))

我已将服务帐户密钥文件放置在群集的主节点和所有工作节点上的/tmp/keyfile.json上，然后我像这样提交作业:

I have placed the service account key file at /tmp/keyfile.json on the master node and all the worker nodes of the cluster then I submit my job like so:

gcloud dataproc jobs submit pyspark \
    ./bq_pyspark.py  \
    --cluster $CLUSTER \
    --region $REGION \
    --properties=spark.hadoop.mapred.bq.auth.service.account.json.keyfile=/tmp/keyfile.json

我也尝试过:

gcloud dataproc jobs submit pyspark \
    ./bq_pyspark.py  \
    --cluster $CLUSTER \
    --region $REGION \
    --properties=spark.hadoop.mapred.bq.auth.service.account.json.keyfile=/tmp/keyfile.json,spark.hadoop.mapred.bq.auth.service.account.enable=true

以下是作业输出的相关部分:

Here are the pertinent sections of the job output:

Bigquery连接器版本0.10.7-hadoop2
18/11/07 13:36:47 INFO com.google.cloud.hadoop.io.bigquery.BigQueryFactory:使用默认凭据创建BigQuery.
18/11/07 13:36:47 INFO com.google.cloud.hadoop.io.bigquery.BigQueryFactory:从给定的凭证中创建BigQuery.
18/11/07 13:36:47 INFO com.google.cloud.hadoop.io.bigquery.BigQueryConfiguration:使用工作路径:'gs://dataproc-9e5dc592-1a35-42e6-9dd6-5f9dd9c8df87-europe-west1/hadoop/tmp/bigquery/pyspark_input20181107133646'
追溯(最近一次通话):

中的文件"/tmp/b6973a26c76d4069a86806dfbd2d7d0f/bq_pyspark.py"，第30行 conf = conf)
在newAPIHadoopRDD中的文件"/usr/lib/spark/python/lib/pyspark.zip/pyspark/context.py"，第702行，
调用
中的文件"/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py"，行1133 deco文件中的文件"/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py"，第63行，
在get_return_value
的第319行中，文件"/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py" py4j.protocol.Py4JJavaError:调用z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD时发生错误.
:com.google.api.client.googleapis.json.GoogleJsonResponseException:禁止的403
{
代码":403，
错误":[{
"domain":"global"，
"message":访问被拒绝:表clientproject:clientdatatset.clienttable:用户mydataprocserviceaccount@myproject.iam.gserviceaccount.com没有对表clientproject:clientdatatset.clienttable的bigquery.tables.get权限."，
"reason":"accessDenied"
}]，
"message":访问被拒绝:表clientproject:clientdatatset.clienttable:用户mydataprocserviceaccount@myproject.iam.gserviceaccount.com没有对表clientproject:clientdatatset.clienttable的bigquery.tables.get权限."
}

Bigquery connector version 0.10.7-hadoop2
18/11/07 13:36:47 INFO com.google.cloud.hadoop.io.bigquery.BigQueryFactory: Creating BigQuery from default credential.
18/11/07 13:36:47 INFO com.google.cloud.hadoop.io.bigquery.BigQueryFactory: Creating BigQuery from given credential.
18/11/07 13:36:47 INFO com.google.cloud.hadoop.io.bigquery.BigQueryConfiguration: Using working path: 'gs://dataproc-9e5dc592-1a35-42e6-9dd6-5f9dd9c8df87-europe-west1/hadoop/tmp/bigquery/pyspark_input20181107133646'
Traceback (most recent call last):
File "/tmp/b6973a26c76d4069a86806dfbd2d7d0f/bq_pyspark.py", line 30, in
conf=conf)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/context.py", line 702, in newAPIHadoopRDD
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in call
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: com.google.api.client.googleapis.json.GoogleJsonResponseException: 403 Forbidden
{
"code" : 403,
"errors" : [ {
"domain" : "global",
"message" : "Access Denied: Table clientproject:clientdatatset.clienttable: The user mydataprocserviceaccount@myproject.iam.gserviceaccount.com does not have bigquery.tables.get permission for table clientproject:clientdatatset.clienttable.",
"reason" : "accessDenied"
} ],
"message" : "Access Denied: Table clientproject:clientdatatset.clienttable: The user mydataprocserviceaccount@myproject.iam.gserviceaccount.com does not have bigquery.tables.get permission for table clientproject:clientdatatset.clienttable."
}

行

18/11/07 13:36:47信息com.google.cloud.hadoop.io.bigquery.BigQueryFactory:使用默认凭据创建BigQuery.

18/11/07 13:36:47 INFO com.google.cloud.hadoop.io.bigquery.BigQueryFactory: Creating BigQuery from default credential.

可能表明我没有正确地传递服务帐户密钥文件中的凭据，所以我想我误解了Igor所说的话(或缺少某些信息).

possibly suggests that I'm not passing the credentials from the service account key file correctly so I guess I've misunderstood what Igor said (or some info is missing).

如果有人能让我知道我要去哪里错了，我将非常感激.

If anyone can let me know where I'm going wrong I'd very much appreciate it.

更新... 我试图通过代码而不是通过命令行提供所需的auth配置:

UPDATE... I have attempted to supply the required auth configuration via code instead of via the command-line:

conf = {
    # Input Parameters.
    'mapred.bq.project.id': project,
    'mapred.bq.gcs.bucket': bucket,
    'mapred.bq.temp.gcs.path': input_directory,
    'mapred.bq.input.project.id': project_id,
    'mapred.bq.input.dataset.id': dataset_id,
    'mapred.bq.input.table.id': table_id,
    'mapred.bq.auth.service.account.enable': 'true',
    'mapred.bq.auth.service.account.keyfile': '/tmp/keyfile.json',
    'mapred.bq.auth.service.account.email': 'username@clientproject.iam.gserviceaccount.com'
}

这次我遇到了另一个错误:

This time I got a different error:

18/11/07 16:44:21 INFO com.google.cloud.hadoop.io.bigquery.BigQueryFactory:使用默认凭据创建BigQuery.
追溯(最近一次通话):
在
中的文件"/tmp/cb5cbb16d59945dd926cab2c1f2f5524/bq_pyspark.py"中，第39行 conf = conf)
在newAPIHadoopRDD中的文件"/usr/lib/spark/python/lib/pyspark.zip/pyspark/context.py"，第702行，
调用
中的文件"/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py"，行1133 deco文件中的文件"/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py"，第63行，
文件"/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py"，第319行，位于get_return_value中 py4j.protocol.Py4JJavaError:调用z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD时发生错误.
:java.io.IOException:toDerInputStream拒绝标记类型123
在sun.security.util.DerValue.toDerInputStream(DerValue.java:881)
在sun.security.pkcs12.PKCS12KeyStore.engineLoad(PKCS12KeyStore.java:1939)
在java.security.KeyStore.load(KeyStore.java:1445)
在com.google.api.client.util.SecurityUtils.loadKeyStore(SecurityUtils.java:82)
位于com.google.api.client.util.SecurityUtils.loadPrivateKeyFromKeyStore(SecurityUtils.java:115)
com.google.api.client.googleapis.auth.oauth2.GoogleCredential $ Builder.setServiceAccountPrivateKeyFromP12File(GoogleCredential.java:670)
在com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromPrivateKeyServiceAccount(CredentialFactory.java:251)
位于com.google.cloud.hadoop.util.CredentialConfiguration.getCredential(CredentialConfiguration.java:100)
在com.google.cloud.hadoop.io.bigquery.BigQueryFactory.createBigQueryCredential(BigQueryFactory.java:95)
在com.google.cloud.hadoop.io.bigquery.BigQueryFactory.getBigQuery(BigQueryFactory.java:115)
com.google.cloud.hadoop.io.bigquery.BigQueryFactory.getBigQueryHelper(BigQueryFactory.java:103)

18/11/07 16:44:21 INFO com.google.cloud.hadoop.io.bigquery.BigQueryFactory: Creating BigQuery from default credential.
Traceback (most recent call last):
File "/tmp/cb5cbb16d59945dd926cab2c1f2f5524/bq_pyspark.py", line 39, in
conf=conf)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/context.py", line 702, in newAPIHadoopRDD
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in call
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.io.IOException: toDerInputStream rejects tag type 123
at sun.security.util.DerValue.toDerInputStream(DerValue.java:881)
at sun.security.pkcs12.PKCS12KeyStore.engineLoad(PKCS12KeyStore.java:1939)
at java.security.KeyStore.load(KeyStore.java:1445)
at com.google.api.client.util.SecurityUtils.loadKeyStore(SecurityUtils.java:82)
at com.google.api.client.util.SecurityUtils.loadPrivateKeyFromKeyStore(SecurityUtils.java:115)
at com.google.api.client.googleapis.auth.oauth2.GoogleCredential$Builder.setServiceAccountPrivateKeyFromP12File(GoogleCredential.java:670)
at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromPrivateKeyServiceAccount(CredentialFactory.java:251)
at com.google.cloud.hadoop.util.CredentialConfiguration.getCredential(CredentialConfiguration.java:100)
at com.google.cloud.hadoop.io.bigquery.BigQueryFactory.createBigQueryCredential(BigQueryFactory.java:95)
at com.google.cloud.hadoop.io.bigquery.BigQueryFactory.getBigQuery(BigQueryFactory.java:115)
at com.google.cloud.hadoop.io.bigquery.BigQueryFactory.getBigQueryHelper(BigQueryFactory.java:103)

我用Google搜索"toDerInputStream拒绝标记类型123"导致我 toDerInputStream拒绝标记类型123 ，这表明我需要使用P12文件进行身份验证.这与在调用堆栈中提到sun.security.pkcs12.PKCS12KeyStore一致.因此，我认为我需要一个P12文件(也称为PKCS#12格式文件)而不是.json文件，这意味着我需要回到客户端询问该问题-从经验来看，我认为这可能需要一些时间获取P12文件.如果有的话，我会报告.

I googled "toDerInputStream rejects tag type 123" which led me to toDerInputStream rejects tag type 123 which suggests I need to authenticate using a P12 file. This is consistent with the mention of sun.security.pkcs12.PKCS12KeyStore in the call stack. Hence, I think I need a P12 file (aka a PKCS#12 format file) rather than a .json file which means I need to go back to the client to ask for that - and from experience I think it may take some time to get the P12 file. I'll report back if/when i get anywhere.

更新2 ...在Igor的帮助下解决了.我错误地指定了mapred.bq.auth.service.account.keyfile，它必须是mapred.bq.auth.service.account.json.keyfile.因此，相关的代码部分变为:

UPDATE 2... figured it out, with Igor's help. I was wrongly specifying mapred.bq.auth.service.account.keyfile, it needed to be mapred.bq.auth.service.account.json.keyfile. Thus the pertinent section of code becomes:

conf = {
    # Input Parameters.
    'mapred.bq.project.id': project,
    'mapred.bq.gcs.bucket': bucket,
    'mapred.bq.temp.gcs.path': input_directory,
    'mapred.bq.input.project.id': project_id,
    'mapred.bq.input.dataset.id': dataset_id,
    'mapred.bq.input.table.id': table_id,
    'mapred.bq.auth.service.account.enable': 'true',
    'mapred.bq.auth.service.account.json.keyfile': '/tmp/keyfile.json'
}
table_data = spark.sparkContext.newAPIHadoopRDD(
    'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
    'org.apache.hadoop.io.LongWritable',
    'com.google.gson.JsonObject',
    conf=conf)

提交命令很简单

gcloud dataproc jobs submit pyspark \
    ./bq_pyspark.py  \
    --cluster $CLUSTER \
    --region $REGION

现在可以正常工作了，我可以从spark-on-dataproc访问biquery中的数据，并使用服务帐户json密钥文件进行身份验证.谢谢伊戈尔.

It now works, I am able to access data in biquery from spark-on-dataproc, authenticating using a service account json key file. Thank you Igor.

用于Dataproc上Spark的BigQuery连接器-无法使用服务帐户密钥文件进行身份验证 [英] BigQuery connector for Spark on Dataproc - cannot authenticate using service account key file

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

用于Dataproc上Spark的BigQuery连接器-无法使用服务帐户密钥文件进行身份验证 [英] BigQuery connector for Spark on Dataproc - cannot authenticate using service account key file

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭