在运行TPOT时,Dask不断失败,并导致工人死亡 [英] Dask keeps failing with killed worker exception while running TPOT

查看:148
本文介绍了在运行TPOT时,Dask不断失败,并导致工人死亡的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在运行 tpot ,而dask在gcp的kubernetes群集上运行,则该群集为24核120 gb内存中有4个节点的kubernetes,我的kubernetes yaml是

Im running tpot with dask running on kubernetes cluster on gcp, the cluster is 24 cores 120 gb memory with 4 nodes of kubernetes, my kubernetes yaml is

apiVersion: v1
kind: Service
metadata:
name: daskd-scheduler
labels:
app: daskd
role: scheduler
spec:
ports:
- port: 8786
  targetPort: 8786
  name: scheduler
- port: 8787
  targetPort: 8787
  name: bokeh
- port: 9786
  targetPort: 9786
  name: http
- port: 8888
  targetPort: 8888
  name: jupyter
selector:
  app: daskd
  role: scheduler

 type: LoadBalancer
 --- 
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
 name: daskd-scheduler
spec:
 replicas: 1
 template:
  metadata:
  labels:
    app: daskd
    role: scheduler
spec:
  containers:
  - name: scheduler
    image: uyogesh/daskml-tpot-gcpfs  # CHANGE THIS TO BE YOUR DOCKER HUB IMAGE
    imagePullPolicy: Always
    command: ["/opt/conda/bin/dask-scheduler"]
    resources:
      requests:
        cpu: 1
        memory: 20000Mi # set aside some extra resources for the scheduler
    ports:
     - containerPort: 8786
     ---
     apiVersion: extensions/v1beta1
     kind: Deployment
     metadata:
       name: daskd-worker
     spec:
       replicas: 3
       template:
      metadata:
        labels:
        app: daskd
        role: worker
    spec:
  containers:
  - name: worker
    image: uyogesh/daskml-tpot-gcpfs  # CHANGE THIS TO BE YOUR DOCKER HUB IMAGE
    imagePullPolicy: Always
    command: [
      "/bin/bash",
      "-cx",
      "env && /opt/conda/bin/dask-worker $DASKD_SCHEDULER_SERVICE_HOST:$DASKD_SCHEDULER_SERVICE_PORT_SCHEDULER --nthreads 8 --nprocs 1 --memory-limit 5e9",
    ]
    resources:
      requests:
        cpu: 2
        memory: 20000Mi

我的数据是400万行和77列,每当我在tpot分类器上运行时,它就在dask集群上运行了一段时间,然后崩溃了,输出日志看起来像是

My data is 4 million rows and 77 columns, whenever i run fit on the tpot classifier, it runs on the dask cluster for a while then it crashes, the output log looks like

KilledWorker:
("('gradientboostingclassifier-fit-1c9d29ce92072868462946c12335e5dd',
0, 4)", 'tcp://10.8.1.14:35499')

我尝试按照dask分布式文档的建议增加每个工作人员的线程数,但问题仍然存在。
我的一些观察是:

I tried increasing threads per worker as suggested by the dask distributed docs, yet the problem persists. Some observations i have made are:


  • 如果n_jobs较少(n_jobs = 4,则
    会运行崩溃前20分钟),因为
    n_jobs = -1会立即崩溃。

  • It'll take longer time to crash if n_jobs is less (for n_jobs=4, it ran for 20 mins before crashing) where as crashes instantly for n_jobs=-1.

实际上y开始工作并获得较少数据的优化模型,
具有10000个数据就可以正常工作。

It'll actually start working and get optimized model for fewer data, with 10000 data it works fine.

所以我的问题是,要完成这项工作,我需要进行哪些更改和修改,我想它是可行的,因为我听说过dask能够处理比我更大的数据。

So my question is, what changes and modifications do i need to make this work, I guess its doable as ive heard dask is capable of handling even bigger data than mine.

推荐答案

Dask的官方文档页面说:

Best practices described on Dask`s official documentation page say:


Kubernetes资源限制和请求应匹配给dask-worker命令提供的
--memory-limit和--nthreads参数。否则,您的工作人员可能会因为Kubernetes将
打包到同一节点中而使节点的可用内存不堪重负,因此
会导致 KilledWorker 错误。

在您的情况下,这些配置参数的值与我所看到的不匹配:

In your case these configuration parameters` values don`t match from what I can see:

Kubernetes `容器限制 20 GB与dask-worker命令限制 5 GB

Kubernetes` Container limit 20 GB vs. dask-worker command limit 5 GB

这篇关于在运行TPOT时,Dask不断失败,并导致工人死亡的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆