限速与Apache星火GCS连接器 [英] Rate limit with Apache Spark GCS connector

查看:425
本文介绍了限速与Apache星火GCS连接器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用的谷歌云存储连接器(而不是一个HDFS谷歌Compute Engine的集群上星火,为的推荐),并获得了大量的限速的错误,如下:

  java.io.IOException异常:错误插入:斗:*****,对象:*****
  在com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.wrapException(GoogleCloudStorageImpl.java:1600)
  在com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl $ 3.run(GoogleCloudStorageImpl.java:475)
  在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  在java.util.concurrent.ThreadPoolExecutor中的$ Worker.run(ThreadPoolExecutor.java:615)
  在java.lang.Thread.run(Thread.java:745)
com.google.api.client.googleapis.json.GoogleJsonResponseException:429请求过多所致
{
  code:429,
  错误:[{
    域:usageLimits
    消息:。更改对象的总数*****超过速率限制,请减少创建,更新和删除请求率,
    原因:rateLimitExceeded
  }],
  消息:更改对象*****超过速率限制的总数请降低建立,更新的速率,和删除请求。
}
  在com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:145)
  在com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
  在com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
  在com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:432)
  在com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
  在com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
  在com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl $ 3.run(GoogleCloudStorageImpl.java:472)
  ... 3个


  • 任何人都知道的任何解决方案?


  • 有没有办法来控制读/写星火率?


  • 有增加的速率限制我的谷歌工程的方法吗?


  • 有没有使用本地硬盘为不具有临时文件的方法
    要与其他从共享?


谢谢!


解决方案

不幸的是,当设置为DEFAULT_FS GCS的使用可以通过目录对象的创建率很高弹出是否使用它只是中间目录或最后输入/输出目录。特别是对于使用GCS作为最终的输出目录,因此很难将任何星火端的解决方法,以减少多余的目录创建请求的速率。

好消息是,大多数这些目录的请求确实是多余的,只是因为系统是用来能够基本上是MKDIR -p,而如果便宜目录已经存在返回true。在我们的情况下,有可能通过捕捉这些错误,然后只是检查是否确实目录得到了其他一些工作人员在比赛中创造条件解决它的GCS-连接器侧。

这应该是<一个固定的现在href=\"https://github.com/GoogleCloudPlatform/bigdata-interop/commit/141b1efab9ef23b6b5f5910d8206fcbc228d2ed7\" rel=\"nofollow\">https://github.com/GoogleCloudPlatform/bigdata-interop/commit/141b1efab9ef23b6b5f5910d8206fcbc228d2ed7

要测试,只需运行:

  git的克隆https://github.com/GoogleCloudPlatform/bigdata-interop.git
CD bigdata,互操作
MVN -P hadoop1包
#或者或2的Hadoop
MVN -P hadoop2包

,你应该找到文件GCS /目标/ GCS-连接器 - * - shaded.jar可供使用。为了将它插入bdutil,只需 CP的gsutil GCS /目标/ GCS-连接器 - * shaded.jar GS://&lt;您的斗&GT; /有的路径/ 和然后编辑 bdutil / bdutil_env.sh Hadoop的1或 bdutil / hadoop2_env.sh 修改

<$p$p><$c$c>GCS_CONNECTOR_JAR='https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-1.4.1-hadoop2.jar'

要代替你的 GS指出://&lt;您的斗&GT; /有的路径/ 路径; bdutil会自动检测到您使用了 GS:// prefixed URI和部署过程中会做正确的事情。

请让我们知道,如果它修复该问题为您服务!

I'm using Spark on a Google Compute Engine cluster with the Google Cloud Storage connector (instead of HDFS, as recommended), and get a lot of "rate limit" errors, as follows:

java.io.IOException: Error inserting: bucket: *****, object: *****
  at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.wrapException(GoogleCloudStorageImpl.java:1600)
  at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl$3.run(GoogleCloudStorageImpl.java:475)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:745)
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 429 Too Many Requests
{
  "code" : 429,
  "errors" : [ {
    "domain" : "usageLimits",
    "message" : "The total number of changes to the object ***** exceeds the rate limit. Please reduce the rate of create, update, and delete requests.",
    "reason" : "rateLimitExceeded"
  } ],
  "message" : "The total number of changes to the object ***** exceeds the rate limit. Please reduce the rate of create, update, and delete requests."
}
  at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:145)
  at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
  at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
  at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:432)
  at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
  at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
  at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl$3.run(GoogleCloudStorageImpl.java:472)
  ... 3 more

  • Anyone knows any solution for that?

  • Is there a way to control the read/write rate of Spark?

  • Is there a way to increase the rate limit for my Google Project?

  • Is there a way to use local Hard-Disk for temp files that don't have to be shared with other slaves?

Thanks!

解决方案

Unfortunately, the usage of GCS when set as the DEFAULT_FS can pop up with high rates of directory-object creation whether using it for just intermediate directories or for final input/output directories. Especially for using GCS as the final output directory, it's difficult to apply any Spark-side workaround to reduce the rate of redundant directory-creation requests.

The good news is that most of these directory requests are indeed redundant, just because the system is used to being able to essentially "mkdir -p", and cheaply return true if the directory already exists. In our case, it's possible to fix it on the GCS-connector side by catching these errors and then just checking whether the directory indeed got created by some other worker in a race condition.

This should be fixed now with https://github.com/GoogleCloudPlatform/bigdata-interop/commit/141b1efab9ef23b6b5f5910d8206fcbc228d2ed7

To test, just run:

git clone https://github.com/GoogleCloudPlatform/bigdata-interop.git
cd bigdata-interop
mvn -P hadoop1 package
# Or or Hadoop 2
mvn -P hadoop2 package

And you should find the files "gcs/target/gcs-connector-*-shaded.jar" available for use. To plug it into bdutil, simply gsutil cp gcs/target/gcs-connector-*shaded.jar gs://<your-bucket>/some-path/ and then edit bdutil/bdutil_env.sh for Hadoop 1 or bdutil/hadoop2_env.sh to change:

GCS_CONNECTOR_JAR='https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-1.4.1-hadoop2.jar'

To instead point at your gs://<your-bucket>/some-path/ path; bdutil automatically detects that you're using a gs:// prefixed URI and will do the right thing during deployment.

Please let us know if it fixes the issue for you!

这篇关于限速与Apache星火GCS连接器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆