使用BigQuery Spark连接器从Dataproc和Datalab连接到BigQuery时出错(从以下位置获取来自元数据服务器的访问令牌时出错) [英] Error connecting to BigQuery from Dataproc with Datalab using BigQuery Spark connector (Error getting access token from metadata server at)

查看:76
本文介绍了使用BigQuery Spark连接器从Dataproc和Datalab连接到BigQuery时出错(从以下位置获取来自元数据服务器的访问令牌时出错)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有BigQuery表,Dataproc集群(带有Datalab),并且遵循以下指南: https://cloud.google.com/dataproc/docs/教程/bigquery-connector-spark-example

I have BigQuery table, Dataproc cluster (with Datalab) and I follow this guide: https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example

bucket = spark._jsc.hadoopConfiguration().get("fs.gs.system.bucket")
project = spark._jsc.hadoopConfiguration().get("fs.gs.project.id")

# Set an input directory for reading data from Bigquery.
todays_date = datetime.strftime(datetime.today(), "%Y-%m-%d-%H-%M-%S")
input_directory = "gs://{}/tmp/bigquery-{}".format(bucket, todays_date)

# Set the configuration for importing data from BigQuery.
# Specifically, make sure to set the project ID and bucket for Cloud Dataproc,
# and the project ID, dataset, and table names for BigQuery.

conf = {
    # Input Parameters
    "mapred.bq.project.id": project,
    "mapred.bq.gcs.bucket": bucket,
    "mapred.bq.temp.gcs.path": input_directory,
    "mapred.bq.input.project.id": project,
    'mapred.bq.input.dataset.id': 'my-test-dataset',
    'mapred.bq.input.table.id': 'test-table'
}

# Read the data from BigQuery into Spark as an RDD.
table_data = spark.sparkContext.newAPIHadoopRDD(
    "com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat",
    "org.apache.hadoop.io.LongWritable",
    "com.google.gson.JsonObject",
    conf=conf)

当我尝试连接到公共数据集时,脚本运行良好.但是,当我尝试连接到我的私有数据集时,出现以下错误:

The script is working fine when I try to connect to public datasets. However, when I try to connect to my private dataset, I receive the following error:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.io.IOException: Error getting access token from metadata server at: http://metadata/computeMetadata/v1/instance/service-accounts/default/token
    at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromMetadataServiceAccount(CredentialFactory.java:210)
    at com.google.cloud.hadoop.util.CredentialConfiguration.getCredential(CredentialConfiguration.java:75)
    at com.google.cloud.hadoop.io.bigquery.BigQueryFactory.createBigQueryCredential(BigQueryFactory.java:82)
    at com.google.cloud.hadoop.io.bigquery.BigQueryFactory.getBigQuery(BigQueryFactory.java:102)
    at com.google.cloud.hadoop.io.bigquery.BigQueryFactory.getBigQueryHelper(BigQueryFactory.java:90)
    at com.google.cloud.hadoop.io.bigquery.AbstractBigQueryInputFormat.getBigQueryHelper(AbstractBigQueryInputFormat.java:357)
    at com.google.cloud.hadoop.io.bigquery.AbstractBigQueryInputFormat.getSplits(AbstractBigQueryInputFormat.java:108)
    at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:125)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
    at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1333)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
    at org.apache.spark.rdd.RDD.take(RDD.scala:1327)
    at org.apache.spark.api.python.SerDeUtil$.pairRDDToPython(SerDeUtil.scala:203)
    at org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDD(PythonRDD.scala:587)
    at org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD(PythonRDD.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.UnknownHostException: metadata
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:589)
    at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:463)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
    at sun.net.www.http.HttpClient.<init>(HttpClient.java:242)
    at sun.net.www.http.HttpClient.New(HttpClient.java:339)
    at sun.net.www.http.HttpClient.New(HttpClient.java:357)
    at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1220)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1156)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1050)
    at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:984)
    at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:93)
    at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972)
    at com.google.cloud.hadoop.util.CredentialFactory$ComputeCredentialWithRetry.executeRefreshToken(CredentialFactory.java:159)
    at com.google.api.client.auth.oauth2.Credential.refreshToken(Credential.java:489)
    at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromMetadataServiceAccount(CredentialFactory.java:208)
    ... 35 more

一些其他信息:

  1. 我正在通过Datalab使用python(pySpark)(通过 https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/树/主/数据实验室)
  2. BigQuery数据位于美国,Dataproc集群位于欧盟
  3. Dataproc图片是最新的(1.2)
  4. Dataproc集群已配置为具有Google范围的API访问权限

推荐答案

根据您收到的错误消息(从元数据服务器获取访问令牌的错误位于:http://metadata/computeMetadata/v1/instance/service-accounts/default/token [...]由以下原因引起:java.net.UnknownHostException:元数据),错误似乎在

As per the error message that you are receiving (Error getting access token from metadata server at: http://metadata/computeMetadata/v1/instance/service-accounts/default/token [...] Caused by: java.net.UnknownHostException: metadata), it looks like the error is in the Service Account not being able to retrieve the access token correctly.

为了简化您的用例场景,我建议您从缩小使用的产品范围开始(因为失败可能在不同的步骤中).为此,我建议您直接从已经运行的Dataproc集群中运行PySpark代码,如

In order to simplify your use case scenario, I would propose that you start by narrowing down the products that you are using (because the failure can be in different steps). To do so, I propose that you run your PySpark code directly from the Dataproc Cluster that you have already running, as explained in the documentation:

  1. 转到GCP控制台中的Dataproc>群集"菜单.
  2. 进入正在使用的群集,然后进入"VM实例"选项卡.
  3. 通过单击名称旁边的"SSH"按钮,SSH进入主节点.
  4. 创建一个脚本 words.py ,其中包含要运行的PySpark代码.
  5. 使用命令 spark-submit words.py 运行脚本.
  1. Go to Dataproc > Clusters menu in the GCP Console.
  2. Get into the cluster you are using, then to the "VM Instances" tab.
  3. SSH into the Master node by clicking on the "SSH" button next to its name.
  4. Create a script words.py containing the PySpark code you want to run.
  5. Run the script with the command spark-submit words.py.

一旦这样做,请检查是否收到相同的错误消息.如果这样做,则问题应该出在Dataproc/BigQuery方面.如果不这样做,它很可能位于Datalab中.我的猜测是,您将收到相同的错误消息,因为它看起来像是凭据问题.

Once you do so, check if you get the same error message. If you do, the issue should be in the Dataproc / BigQuery side. If you don't, it most probably is located in Datalab instead. My guessing is that you would get the same error message, as it looks like a credentials issue.

一旦(有可能)确定了问题所在,请在将SSH SSH到群集的主节点后,通过在终端中运行以下命令来查看正在使用的服务帐户:

Once you have (possibly) identified where the issue lies, see which Service Account you are using by running the following command in the terminal when you have SSHed into the master node in your cluster:

gcloud auth list

还通过运行以下命令,确保环境变量 GOOGLE_APPLICATION_CREDENTIALS 为空.如果为空,则运行节点的VM实例将使用GCE的默认服务帐户(该帐户应为运行 gcloud auth list 时获得的帐户,因为Dataproc在GCE实例上运行).如果不为空,它将使用此环境变量指向的凭据文件.使用默认凭据还是使用自定义凭据都是实现选择.

Also make sure that the environment variable GOOGLE_APPLICATION_CREDENTIALS is empty by running the command below. If it is empty, the VM instance where the node is running will use the default service account for GCE (which should be the one you obtained when running gcloud auth list, as Dataproc runs over GCE instances). If it is not empty, it will use the credentials file to which this environment variable is pointing. It is an implementation choice whether to use the default credentials or a custom one.

echo $GOOGLE_APPLICATION_CREDENTIALS

一旦知道使用了哪个服务帐户,请移至控制台中的IAM选项卡,并检查该服务帐户是否具有正确的

Once you know which Service Account is being used, move to the IAM tab in the Console, and check whether this Service Account has the right roles and permissions to access BigQuery.

我的猜测是问题可能与使用中的服务帐户有关,并且 GOOGLE_APPLICATION_CREDENTIALS 可能指向了错误的位置,因此您应该首先确保身份验证配置正确一;为此,我将直接在主节点内部运行代码,以简化用例并减少涉及的组件.

My guessing is that the issue may be related to the Service Account in use, and probably GOOGLE_APPLICATION_CREDENTIALS is pointing to the wrong location, so you should start by making sure that your authentication configuration is the correct one; and to do so, I would run the code directly from inside the master node, in order to simplify the use case and reduce the components involved.

这篇关于使用BigQuery Spark连接器从Dataproc和Datalab连接到BigQuery时出错(从以下位置获取来自元数据服务器的访问令牌时出错)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆