Dataproc 上 Spark 的 BigQuery 连接器 - 无法使用服务帐户密钥文件进行身份验证 [英] BigQuery connector for Spark on Dataproc - cannot authenticate using service account key file
问题描述
我已遵循 将 BigQuery 连接器与 Spark 结合使用 从公开可用的数据集中成功获取数据.我现在需要访问一个由我们的一个客户拥有的 bigquery 数据集,我已经为其提供了一个服务帐户密钥文件(我知道服务帐户密钥文件是有效的,因为我可以使用它来使用
I have followed Use the BigQuery connector with Spark to successfully get data from a publicly available dataset. I now need to access a bigquery dataset that is owned by one of our clients and for which I have been given a service account key file (I know that the service account key file is valid because I can use it to connect using the Google BigQuery library for Python).
我遵循了 Igor Dvorzhak 推荐的这里
I have followed what Igor Dvorzhak recommended here
要使用服务帐户密钥文件授权,您需要将 mapred.bq.auth.service.account.enable
属性设置为 true 并使用 mapred 将 BigQuery 连接器指向服务帐户 json 密钥文件.bq.auth.service.account.json.keyfile
属性
To use service account key file authorization you need to set
mapred.bq.auth.service.account.enable
property to true and point BigQuery connector to a service account json keyfile usingmapred.bq.auth.service.account.json.keyfile
property
像这样:
from pyspark.sql import SparkSession
from datetime import datetime
spark = SparkSession.builder.appName("SparkSessionBQExample").enableHiveSupport().getOrCreate()
bucket = spark._jsc.hadoopConfiguration().get('fs.gs.system.bucket')
project = spark._jsc.hadoopConfiguration().get('fs.gs.project.id')
input_directory = 'gs://{}/hadoop/tmp/bigquery/pyspark_input{}'.format(bucket, datetime.now().strftime("%Y%m%d%H%M%S"))
project_id = 'clientproject'#'publicdata'
dataset_id = 'clientdataset'#samples'
table_id = 'clienttable'#'shakespeare'
conf = {
# Input Parameters.
'mapred.bq.project.id': project,
'mapred.bq.gcs.bucket': bucket,
'mapred.bq.temp.gcs.path': input_directory,
'mapred.bq.input.project.id': project_id,
'mapred.bq.input.dataset.id': dataset_id,
'mapred.bq.input.table.id': table_id,
'mapred.bq.auth.service.account.enable': 'true'
}
# Load data in from BigQuery.
table_data = spark.sparkContext.newAPIHadoopRDD(
'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
'org.apache.hadoop.io.LongWritable',
'com.google.gson.JsonObject',
conf=conf)
print ('row tally={}'.format(table_data.toDF().count()))
我已将服务帐户密钥文件放在主节点和集群的所有工作节点上的 /tmp/keyfile.json
然后我像这样提交我的工作:
I have placed the service account key file at /tmp/keyfile.json
on the master node and all the worker nodes of the cluster then I submit my job like so:
gcloud dataproc jobs submit pyspark
./bq_pyspark.py
--cluster $CLUSTER
--region $REGION
--properties=spark.hadoop.mapred.bq.auth.service.account.json.keyfile=/tmp/keyfile.json
我也试过:
gcloud dataproc jobs submit pyspark
./bq_pyspark.py
--cluster $CLUSTER
--region $REGION
--properties=spark.hadoop.mapred.bq.auth.service.account.json.keyfile=/tmp/keyfile.json,spark.hadoop.mapred.bq.auth.service.account.enable=true
以下是作业输出的相关部分:
Here are the pertinent sections of the job output:
Bigquery 连接器版本 0.10.7-hadoop2
18/11/07 13:36:47 信息 com.google.cloud.hadoop.io.bigquery.BigQueryFactory:从默认凭据创建 BigQuery.
18/11/07 13:36:47 INFO com.google.cloud.hadoop.io.bigquery.BigQueryFactory:根据给定的凭据创建 BigQuery.
18/11/07 13:36:47 INFO com.google.cloud.hadoop.io.bigquery.BigQueryConfiguration:使用工作路径:'gs://dataproc-9e5dc592-1a35-42e6-9dd6-5f9dd9c8df87-europe-west1/hadoop/tmp/bigquery/pyspark_input20181107133646'
回溯(最近一次调用最后一次):
文件/tmp/b6973a26c76d4069a86806dfbd2d7d0f/bq_pyspark.py",第 30 行,在
conf=conf)
文件/usr/lib/spark/python/lib/pyspark.zip/pyspark/context.py",第 702 行,在 newAPIHadoopRDD
文件/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",第 1133 行,调用
文件/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py",第 63 行,在 deco
文件/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py",第 319 行,在 get_return_value
py4j.protocol.Py4JJavaError:调用 z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD 时发生错误.
: com.google.api.client.googleapis.json.GoogleJsonResponseException: 403 Forbidden
{
代码":403,
错误":[ {
"域": "全局",
"message" : "访问被拒绝:表 clientproject:clientdatatset.clienttable: 用户 mydataprocserviceaccount@myproject.iam.gserviceaccount.com 没有表 clientproject:clientdatatset.clienttable 的 bigquery.tables.get 权限.",
原因":访问被拒绝"
}],
"message" : "访问被拒绝:表 clientproject:clientdatatset.clienttable: 用户 mydataprocserviceaccount@myproject.iam.gserviceaccount.com 没有表 clientproject:clientdatatset.clienttable 的 bigquery.tables.get 权限."
}
Bigquery connector version 0.10.7-hadoop2
18/11/07 13:36:47 INFO com.google.cloud.hadoop.io.bigquery.BigQueryFactory: Creating BigQuery from default credential.
18/11/07 13:36:47 INFO com.google.cloud.hadoop.io.bigquery.BigQueryFactory: Creating BigQuery from given credential.
18/11/07 13:36:47 INFO com.google.cloud.hadoop.io.bigquery.BigQueryConfiguration: Using working path: 'gs://dataproc-9e5dc592-1a35-42e6-9dd6-5f9dd9c8df87-europe-west1/hadoop/tmp/bigquery/pyspark_input20181107133646'
Traceback (most recent call last):
File "/tmp/b6973a26c76d4069a86806dfbd2d7d0f/bq_pyspark.py", line 30, in
conf=conf)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/context.py", line 702, in newAPIHadoopRDD
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in call
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: com.google.api.client.googleapis.json.GoogleJsonResponseException: 403 Forbidden
{
"code" : 403,
"errors" : [ {
"domain" : "global",
"message" : "Access Denied: Table clientproject:clientdatatset.clienttable: The user mydataprocserviceaccount@myproject.iam.gserviceaccount.com does not have bigquery.tables.get permission for table clientproject:clientdatatset.clienttable.",
"reason" : "accessDenied"
} ],
"message" : "Access Denied: Table clientproject:clientdatatset.clienttable: The user mydataprocserviceaccount@myproject.iam.gserviceaccount.com does not have bigquery.tables.get permission for table clientproject:clientdatatset.clienttable."
}
线
18/11/07 13:36:47 INFO com.google.cloud.hadoop.io.bigquery.BigQueryFactory:从默认凭据创建 BigQuery.
18/11/07 13:36:47 INFO com.google.cloud.hadoop.io.bigquery.BigQueryFactory: Creating BigQuery from default credential.
可能表明我没有正确传递服务帐户密钥文件中的凭据,所以我想我误解了 Igor 所说的内容(或缺少某些信息).
possibly suggests that I'm not passing the credentials from the service account key file correctly so I guess I've misunderstood what Igor said (or some info is missing).
如果有人能告诉我哪里出错了,我将不胜感激.
If anyone can let me know where I'm going wrong I'd very much appreciate it.
更新...我试图通过代码而不是通过命令行提供所需的身份验证配置:
UPDATE... I have attempted to supply the required auth configuration via code instead of via the command-line:
conf = {
# Input Parameters.
'mapred.bq.project.id': project,
'mapred.bq.gcs.bucket': bucket,
'mapred.bq.temp.gcs.path': input_directory,
'mapred.bq.input.project.id': project_id,
'mapred.bq.input.dataset.id': dataset_id,
'mapred.bq.input.table.id': table_id,
'mapred.bq.auth.service.account.enable': 'true',
'mapred.bq.auth.service.account.keyfile': '/tmp/keyfile.json',
'mapred.bq.auth.service.account.email': 'username@clientproject.iam.gserviceaccount.com'
}
这次我遇到了不同的错误:
This time I got a different error:
18/11/07 16:44:21 INFO com.google.cloud.hadoop.io.bigquery.BigQueryFactory:从默认凭据创建 BigQuery.
回溯(最近一次调用最后一次):
文件/tmp/cb5cbb16d59945dd926cab2c1f2f5524/bq_pyspark.py",第 39 行,在
conf=conf)
文件/usr/lib/spark/python/lib/pyspark.zip/pyspark/context.py",第 702 行,在 newAPIHadoopRDD
文件/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",第 1133 行,调用
文件/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py",第 63 行,在 deco
文件/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py",第 319 行,在 get_return_value 中py4j.protocol.Py4JJavaError:调用 z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD 时发生错误.
: java.io.IOException: toDerInputStream 拒绝标签类型 123
在 sun.security.util.DerValue.toDerInputStream(DerValue.java:881)
在 sun.security.pkcs12.PKCS12KeyStore.engineLoad(PKCS12KeyStore.java:1939)
在 java.security.KeyStore.load(KeyStore.java:1445)
在 com.google.api.client.util.SecurityUtils.loadKeyStore(SecurityUtils.java:82)
在 com.google.api.client.util.SecurityUtils.loadPrivateKeyFromKeyStore(SecurityUtils.java:115)
在 com.google.api.client.googleapis.auth.oauth2.GoogleCredential$Builder.setServiceAccountPrivateKeyFromP12File(GoogleCredential.java:670)
在 com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromPrivateKeyServiceAccount(CredentialFactory.java:251)
在 com.google.cloud.hadoop.util.CredentialConfiguration.getCredential(CredentialConfiguration.java:100)
在 com.google.cloud.hadoop.io.bigquery.BigQueryFactory.createBigQueryCredential(BigQueryFactory.java:95)
在 com.google.cloud.hadoop.io.bigquery.BigQueryFactory.getBigQuery(BigQueryFactory.java:115)
在 com.google.cloud.hadoop.io.bigquery.BigQueryFactory.getBigQueryHelper(BigQueryFactory.java:103)
18/11/07 16:44:21 INFO com.google.cloud.hadoop.io.bigquery.BigQueryFactory: Creating BigQuery from default credential.
Traceback (most recent call last):
File "/tmp/cb5cbb16d59945dd926cab2c1f2f5524/bq_pyspark.py", line 39, in
conf=conf)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/context.py", line 702, in newAPIHadoopRDD
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in call
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.io.IOException: toDerInputStream rejects tag type 123
at sun.security.util.DerValue.toDerInputStream(DerValue.java:881)
at sun.security.pkcs12.PKCS12KeyStore.engineLoad(PKCS12KeyStore.java:1939)
at java.security.KeyStore.load(KeyStore.java:1445)
at com.google.api.client.util.SecurityUtils.loadKeyStore(SecurityUtils.java:82)
at com.google.api.client.util.SecurityUtils.loadPrivateKeyFromKeyStore(SecurityUtils.java:115)
at com.google.api.client.googleapis.auth.oauth2.GoogleCredential$Builder.setServiceAccountPrivateKeyFromP12File(GoogleCredential.java:670)
at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromPrivateKeyServiceAccount(CredentialFactory.java:251)
at com.google.cloud.hadoop.util.CredentialConfiguration.getCredential(CredentialConfiguration.java:100)
at com.google.cloud.hadoop.io.bigquery.BigQueryFactory.createBigQueryCredential(BigQueryFactory.java:95)
at com.google.cloud.hadoop.io.bigquery.BigQueryFactory.getBigQuery(BigQueryFactory.java:115)
at com.google.cloud.hadoop.io.bigquery.BigQueryFactory.getBigQueryHelper(BigQueryFactory.java:103)
我在谷歌上搜索toDerInputStream 拒绝标签类型 123",这导致我toDerInputStream 拒绝标签类型 123 这表明我需要使用 P12 文件进行身份验证.这与调用堆栈中提到的 sun.security.pkcs12.PKCS12KeyStore
一致.因此,我认为我需要一个 P12 文件(又名 PKCS#12 格式文件)而不是一个 .json 文件,这意味着我需要回到客户端来要求 - 从经验来看,我认为这可能需要一些时间获取 P12 文件.如果/当我到达任何地方时,我会回来报告.
I googled "toDerInputStream rejects tag type 123" which led me to toDerInputStream rejects tag type 123 which suggests I need to authenticate using a P12 file. This is consistent with the mention of sun.security.pkcs12.PKCS12KeyStore
in the call stack. Hence, I think I need a P12 file (aka a PKCS#12 format file) rather than a .json file which means I need to go back to the client to ask for that - and from experience I think it may take some time to get the P12 file. I'll report back if/when i get anywhere.
更新 2...在 Igor 的帮助下弄清楚了.我错误地指定了 mapred.bq.auth.service.account.keyfile
,它需要是 mapred.bq.auth.service.account.json.keyfile
.因此,相关的代码部分变为:
UPDATE 2... figured it out, with Igor's help. I was wrongly specifying mapred.bq.auth.service.account.keyfile
, it needed to be mapred.bq.auth.service.account.json.keyfile
. Thus the pertinent section of code becomes:
conf = {
# Input Parameters.
'mapred.bq.project.id': project,
'mapred.bq.gcs.bucket': bucket,
'mapred.bq.temp.gcs.path': input_directory,
'mapred.bq.input.project.id': project_id,
'mapred.bq.input.dataset.id': dataset_id,
'mapred.bq.input.table.id': table_id,
'mapred.bq.auth.service.account.enable': 'true',
'mapred.bq.auth.service.account.json.keyfile': '/tmp/keyfile.json'
}
table_data = spark.sparkContext.newAPIHadoopRDD(
'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
'org.apache.hadoop.io.LongWritable',
'com.google.gson.JsonObject',
conf=conf)
提交命令很简单
gcloud dataproc jobs submit pyspark
./bq_pyspark.py
--cluster $CLUSTER
--region $REGION
它现在可以工作了,我可以从 spark-on-dataproc 访问 biquery 中的数据,并使用服务帐户 json 密钥文件进行身份验证.谢谢伊戈尔.
It now works, I am able to access data in biquery from spark-on-dataproc, authenticating using a service account json key file. Thank you Igor.
推荐答案
问题似乎在这里:
警告:忽略非火花配置属性:mapred.bq.auth.service.account.json.keyfile=/tmp/keyfile.json
Warning: Ignoring non-spark config property: mapred.bq.auth.service.account.json.keyfile=/tmp/keyfile.json
要解决此问题,您应该设置Spark 中带有 spark.hadoop
前缀的 Hadoop 属性:
To fix this, you should set Hadoop properties with spark.hadoop
prefix in Spark:
gcloud dataproc jobs submit pyspark ./bq_pyspark.py
--cluster $CLUSTER --region $REGION
--properties=spark.hadoop.mapred.bq.auth.service.account.json.keyfile=/tmp/keyfile.json
这篇关于Dataproc 上 Spark 的 BigQuery 连接器 - 无法使用服务帐户密钥文件进行身份验证的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!