使用 terraform 添加带有 GPU 的 GKE 节点池 [英] Adding GKE node pool with GPU using terraform

查看:38
本文介绍了使用 terraform 添加带有 GPU 的 GKE 节点池的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试使用 GPU 创建 google_container_node_pool.我尝试了机器类型 nvidia-tesla-p4 和 a2-highgpu-1g,每个都返回不同的错误:

<块引用>

projects/my-project-id/zones/us-central1-a/machineTypes/nvidia-tesla-p4

<块引用>

错误:创建 NodePool 时出错:googleapi:错误 403:不足满足请求的区域配额:资源PREEMPTIBLE_NVIDIA_V100_GPUS":请求需要3.0"并且很短'2.0'.项目的配额为1.0",1.0"可用.查看和管理配额

解决方案

显示第一条消息是因为 GCP 中没有名为 nvidia-tesla-p4 的机器类型.在

您提到现在您收到一条消息,提示您没有足够的 A2 CPU 配额.请确保区域中有足够的 CPU 配额并且也有足够的 A2 CPU 配额.为此,您必须考虑

您可以在此处阅读更多关于使用 CPU 配额的信息.

我希望这些信息对澄清您的问题有用.

I try to create google_container_node_pool with GPUs. I tried machine type nvidia-tesla-p4 and a2-highgpu-1g, each return a different error:

projects/my-project-id/zones/us-central1-a/machineTypes/nvidia-tesla-p4

or

Error: error creating NodePool: googleapi: Error 403: Insufficient regional quota to satisfy request: resource "PREEMPTIBLE_NVIDIA_V100_GPUS": request requires '3.0' and is short '2.0'. project has a quota of '1.0' with '1.0' available. View and manage quotas at https://console.cloud.google.com/iam-admin/quotas?usage=USED&project=my-project-id., forbidden

When I check the quotas page, the relevant quota shows "All 99 quotas are within limit".

According to the requirement I need quota but they don't specify which quota.

Update:

Changing the machine_type to a2-highgpu-1g changed the error message to relate to a different quota, A2_CPUS. When I change the value of preemptible to false, instead of PREEMPTIBLE_NVIDIA_V100_GPUS or A2_CPUS I get the same error for NVIDIA_A100_GPUS. The problem with both A2_CPUS and NVIDIA_A100_GPUS is that I can't ask for quota as the checkbox in the UI is disabled and it shows limit as "Unlimited":

解决方案

The first message you see is shown because there is not a machine-type named nvidia-tesla-p4 in GCP. In this document there is a comprehensive list of the available machine-types, but make sure to use a machine type available in the region and zone where you're spinning up your GKE cluster. You can check the valid machine-types available in a zone with this command: gcloud compute machine-types list --filter="zone:( ZONE … )"

Regarding the second message, it is clear that you don't have enough quota for that specific GPU in that region. As @hilsenrat has mentioned, you can't see any quotas being exhausted as the cluster never got created in the first place.

As mentioned in the Availability section of the documentation on running GPUs in GKE:

GPUs are available in specific regions and zones. When you request GPU quota, consider the regions in which you intend to run your clusters.

For a complete list of applicable regions and zones, refer to GPUs on Compute Engine.

To see a list of all GPU accelerator types supported in each zone, run the following command:gcloud compute accelerator-types list --filter="zone:( ZONE )"

As when you add a GPU to a preemptible instance, you use your regular GPU quota, I would also make sure that the quota for V100 in the REGION is enough. If you need a separate quota for preemptible GPUs, request a separate Preemptible GPU quota as described here.

I suggest going to the quota page and filtering these specific quotas, making sure you click on "ALL QUOTAS" under the Details column. Regional quotas will be displayed.

  • Service: Compute Engine API

  •   Name: GPUs (all regions)
    

  •   Quota Metric: compute.googleapis.com/gpus_all_regions
    

  •   Limit Name: GPUS-ALL-REGIONS-per-project
    

  • Service: Compute Engine API

  •    Name: NVIDIA V100 GPUs
    

  •    Quota Metric: compute.googleapis.com/nvidia_v100_gpus
    

  •    Limit Name: NVIDIA-V100-GPUS-per-project-zone/NVIDIA-V100-GPUS-per-project-region
    

  • Service: Compute Engine API

  •    Name: Preemptible NVIDIA V100 GPUs
    

  •    Quota Metric: compute.googleapis.com/preemptible_nvidia_v100_gpus
    

  •    Limit Name: PREEMPTIBLE-NVIDIA-V100-GPUS-per-project-zone/PREEMPTIBLE-NVIDIA-V100-GPUS-per-project-region
    

Make sure you have enough GLOBAL AND REGIONAL quota for the specific GPU model you are trying to use. Preemptible GPUs need to be requested separately as mentioned here.

------UPDATE----

Also, please note that only regional quotas can be requested for an increase. Any zonal quota listed is dependant on the corresponding regional quota. In this capture, even if the zonal limits read unlimited, the regional quota is 0 and attempting to use GPUs in the whole region will fail. (As you can see, only regional quota is selectable for edition).

You mention that now you get a message mentioning you don't have enough quota for A2 CPUs. Please make sure to have enough CPU quota in the Region AND enough A2 CPU quota as well. For this you have to consider the number of vCPUs required for the machine type you want to deploy.

You can read more about working with CPU quotas here.

I hope this information is useful an clarifies your question.

这篇关于使用 terraform 添加带有 GPU 的 GKE 节点池的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆