IPython Notebook Kernel在运行Kmeans时死机 [英] IPython Notebook Kernel getting dead while running Kmeans

查看:238
本文介绍了IPython Notebook Kernel在运行Kmeans时死机的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在12个变量的400K观测值上运行K-means聚类。最初,当我使用Kmeans代码运行单元格时,它会在2分钟后弹出一条消息,说内核被中断并将重新启动。然后需要很长时间,好像内核已经死了,代码将不再运行。



所以我尝试了125k观察,同样没有。变量。但我仍然得到同样的信息。



这是什么意思?这是否意味着ipython笔记本无法在125k观察中运行kmeans并杀死内核?



如何解决这个问题?这对我来说非常重要。 :(



请提供建议。



我使用过的代码:


$ b $来自sklearn.cluster的b

从sklearn.metrics导入KMeans
导入silhouette_score

 #使用n_clusters初始化聚类器值和一个随机生成器
#seed的10为再现性。
kmeans = KMeans(n_clusters = 2,init ='k-means ++',n_init = 10,max_iter = 100)
kmeans。 fit(Data_sampled.ix [:,1:])
cluster_labels = kmeans.labels_
#silhouette_score给出所有样本的平均值。
#这给出了密度和分离形成的
#clusters
silhouette_avg = silhouette_score(Data_sampled.ix [:,1:],cluster_labels)


解决方案

从某些调查来看,这可能与iPython Notebook / Jupyter无关。看来这是 sklearn <的一个问题/ code>,追溯到 numpy 的问题。查看rela ted github问题 sklearn



根据相关的SO答案这里,您的内核进程可能正在获得被操作系统杀死,因为它占用了太多内存。



最终,这需要在 sklearn的基础代码库中进行一些修复

和/或 numpy 。您可以在过渡期间尝试的一些选项:




  • 关闭计算机上运行的每个无关程序(spotify,slack等),希望释放足够的内存,并在脚本运行时密切监视内存

  • 在比你的机器有更多内存的临时远程服务器上运行计算,看看是否有帮助(尽管从那以后)认为内存使用至少相对于样本数是多项式的,这可能不起作用)

  • 使用完整数据集训练你的分类器,但随后用随机子集计算轮廓分数你的数据。 (大多数人似乎能够通过20-30k观察得到这个)



或者,如果你比我聪明,有一些空闲时间,考虑尝试提供修复 sklearn 和/或 numpy :)


I am running K-means clustering on some 400K observations with 12 variables. Initially as soon as I run the cell with Kmeans code, it would pop up a message after 2 mins saying the kernel is interrupted and would restart. And then it takes ages like as if the kernel got dead and the code won't run anymore.

So I tried with 125k observations and same no. of variables. But still the same message I got.

What is meant by that?. Does it mean ipython notebook is not able to run kmeans on 125k observations and kills the kernel?.

How to solve this?. This is pretty important for me to do by today. :(

Please advise.

Code I used:

from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score

    # Initialize the clusterer with n_clusters value and a random generator
    # seed of 10 for reproducibility.
kmeans=KMeans(n_clusters=2,init='k-means++',n_init=10, max_iter=100)
kmeans.fit(Data_sampled.ix[:,1:])
cluster_labels = kmeans.labels_
    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
silhouette_avg = silhouette_score(Data_sampled.ix[:,1:],cluster_labels)

解决方案

From some investigation, this likely has nothing to do with iPython Notebook / Jupyter. It seems this is an issue with sklearn, which traces back to an issue with numpy. See related github issues sklearn here and here, and the underlying numpy issue here.

Ultimately, calculating the Silhouette Score requires calculating a very large distance matrix, and it seems that distance matrix is taking up too much memory on your system for large numbers of rows. For instance, look at memory pressure on my system (OSX, 8GB ram) during two runs of a similar calculation - the first spike is a Silhouette Score calculation with 10k records, the second ... plateau .. is with 40k records:

Per the related SO answer here, your kernel process is probably getting killed by the OS because it is taking too much memory.

Ultimately, this is going to require some fixes in the underlying codebase for sklearn and/or numpy. Some options that you can try in the interim:

  • close every extraneous program running on your computer (spotify, slack, etc.), hope that frees up enough memory, and monitor memory closely while your script is running
  • run the calculation on a temporary remote server with more RAM than your machine has and see if that helps (although since I think the memory use is at least polynomial with respect to the number of samples, this may not work)
  • train your classifier with your full data set, but then calculate silhouette scores with a random subset of your data. (most people seem to be able to get this working with 20-30k observations)

Or, if you're smarter than me and have some free time, consider trying out contributing a fix to sklearn and/or numpy :)

这篇关于IPython Notebook Kernel在运行Kmeans时死机的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆