在运行TPOT时,Dask不断失败,并导致工人死亡 [英] Dask keeps failing with killed worker exception while running TPOT
问题描述
我正在运行 tpot ,而dask在gcp的kubernetes群集上运行,则该群集为24核120 gb内存中有4个节点的kubernetes,我的kubernetes yaml是
Im running tpot with dask running on kubernetes cluster on gcp, the cluster is 24 cores 120 gb memory with 4 nodes of kubernetes, my kubernetes yaml is
apiVersion: v1
kind: Service
metadata:
name: daskd-scheduler
labels:
app: daskd
role: scheduler
spec:
ports:
- port: 8786
targetPort: 8786
name: scheduler
- port: 8787
targetPort: 8787
name: bokeh
- port: 9786
targetPort: 9786
name: http
- port: 8888
targetPort: 8888
name: jupyter
selector:
app: daskd
role: scheduler
type: LoadBalancer
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: daskd-scheduler
spec:
replicas: 1
template:
metadata:
labels:
app: daskd
role: scheduler
spec:
containers:
- name: scheduler
image: uyogesh/daskml-tpot-gcpfs # CHANGE THIS TO BE YOUR DOCKER HUB IMAGE
imagePullPolicy: Always
command: ["/opt/conda/bin/dask-scheduler"]
resources:
requests:
cpu: 1
memory: 20000Mi # set aside some extra resources for the scheduler
ports:
- containerPort: 8786
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: daskd-worker
spec:
replicas: 3
template:
metadata:
labels:
app: daskd
role: worker
spec:
containers:
- name: worker
image: uyogesh/daskml-tpot-gcpfs # CHANGE THIS TO BE YOUR DOCKER HUB IMAGE
imagePullPolicy: Always
command: [
"/bin/bash",
"-cx",
"env && /opt/conda/bin/dask-worker $DASKD_SCHEDULER_SERVICE_HOST:$DASKD_SCHEDULER_SERVICE_PORT_SCHEDULER --nthreads 8 --nprocs 1 --memory-limit 5e9",
]
resources:
requests:
cpu: 2
memory: 20000Mi
我的数据是400万行和77列,每当我在tpot分类器上运行时,它就在dask集群上运行了一段时间,然后崩溃了,输出日志看起来像是
My data is 4 million rows and 77 columns, whenever i run fit on the tpot classifier, it runs on the dask cluster for a while then it crashes, the output log looks like
KilledWorker:
("('gradientboostingclassifier-fit-1c9d29ce92072868462946c12335e5dd',
0, 4)", 'tcp://10.8.1.14:35499')
我尝试按照dask分布式文档的建议增加每个工作人员的线程数,但问题仍然存在。
我的一些观察是:
I tried increasing threads per worker as suggested by the dask distributed docs, yet the problem persists. Some observations i have made are:
-
如果n_jobs较少(n_jobs = 4,则
会运行崩溃前20分钟),因为
n_jobs = -1会立即崩溃。
It'll take longer time to crash if n_jobs is less (for n_jobs=4, it ran for 20 mins before crashing) where as crashes instantly for n_jobs=-1.
实际上y开始工作并获得较少数据的优化模型,
具有10000个数据就可以正常工作。
It'll actually start working and get optimized model for fewer data, with 10000 data it works fine.
所以我的问题是,要完成这项工作,我需要进行哪些更改和修改,我想它是可行的,因为我听说过dask能够处理比我更大的数据。
So my question is, what changes and modifications do i need to make this work, I guess its doable as ive heard dask is capable of handling even bigger data than mine.
推荐答案
Dask的官方文档页面说:
Best practices described on Dask`s official documentation page say:
Kubernetes资源限制和请求应匹配给dask-worker命令提供的
--memory-limit和--nthreads参数。否则,您的工作人员可能会因为Kubernetes将
打包到同一节点中而使节点的可用内存不堪重负,因此
会导致 KilledWorker 错误。
在您的情况下,这些配置参数的值与我所看到的不匹配:
In your case these configuration parameters` values don`t match from what I can see:
Kubernetes `容器限制 20 GB与dask-worker命令限制 5 GB
Kubernetes` Container limit 20 GB vs. dask-worker command limit 5 GB
这篇关于在运行TPOT时,Dask不断失败,并导致工人死亡的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!