Apache Spark GCS 连接器的速率限制 [英] Rate limit with Apache Spark GCS connector

查看:28
本文介绍了Apache Spark GCS 连接器的速率限制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

I'm using Spark on a Google Compute Engine cluster with the Google Cloud Storage connector (instead of HDFS, as recommended), and get a lot of "rate limit" errors, as follows:

java.io.IOException: Error inserting: bucket: *****, object: *****
  at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.wrapException(GoogleCloudStorageImpl.java:1600)
  at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl$3.run(GoogleCloudStorageImpl.java:475)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:745)
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 429 Too Many Requests
{
  "code" : 429,
  "errors" : [ {
    "domain" : "usageLimits",
    "message" : "The total number of changes to the object ***** exceeds the rate limit. Please reduce the rate of create, update, and delete requests.",
    "reason" : "rateLimitExceeded"
  } ],
  "message" : "The total number of changes to the object ***** exceeds the rate limit. Please reduce the rate of create, update, and delete requests."
}
  at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:145)
  at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
  at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
  at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:432)
  at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
  at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
  at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl$3.run(GoogleCloudStorageImpl.java:472)
  ... 3 more

  • Anyone knows any solution for that?

  • Is there a way to control the read/write rate of Spark?

  • Is there a way to increase the rate limit for my Google Project?

  • Is there a way to use local Hard-Disk for temp files that don't have to be shared with other slaves?

Thanks!

解决方案

Unfortunately, the usage of GCS when set as the DEFAULT_FS can pop up with high rates of directory-object creation whether using it for just intermediate directories or for final input/output directories. Especially for using GCS as the final output directory, it's difficult to apply any Spark-side workaround to reduce the rate of redundant directory-creation requests.

The good news is that most of these directory requests are indeed redundant, just because the system is used to being able to essentially "mkdir -p", and cheaply return true if the directory already exists. In our case, it's possible to fix it on the GCS-connector side by catching these errors and then just checking whether the directory indeed got created by some other worker in a race condition.

This should be fixed now with https://github.com/GoogleCloudPlatform/bigdata-interop/commit/141b1efab9ef23b6b5f5910d8206fcbc228d2ed7

To test, just run:

git clone https://github.com/GoogleCloudPlatform/bigdata-interop.git
cd bigdata-interop
mvn -P hadoop1 package
# Or or Hadoop 2
mvn -P hadoop2 package

And you should find the files "gcs/target/gcs-connector-*-shaded.jar" available for use. To plug it into bdutil, simply gsutil cp gcs/target/gcs-connector-*shaded.jar gs://<your-bucket>/some-path/ and then edit bdutil/bdutil_env.sh for Hadoop 1 or bdutil/hadoop2_env.sh to change:

GCS_CONNECTOR_JAR='https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-1.4.1-hadoop2.jar'

To instead point at your gs://<your-bucket>/some-path/ path; bdutil automatically detects that you're using a gs:// prefixed URI and will do the right thing during deployment.

Please let us know if it fixes the issue for you!

这篇关于Apache Spark GCS 连接器的速率限制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆