局部拟合是否在sklearn.decomposition.IncrementalPCA中并行运行? [英] Does partial fit runs in parallel in sklearn.decomposition.IncrementalPCA?

查看:159
本文介绍了局部拟合是否在sklearn.decomposition.IncrementalPCA中并行运行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我关注了 Imanol Luengo

I've followed Imanol Luengo's answer to build a partial fit and transform for sklearn.decomposition.IncrementalPCA. But for some reason, it looks like (from htop) it uses all CPU cores at maximum. I could find neither n_jobs parameter nor anything related to multiprocessing. My question is: if this is default behavior of these functions how can I set the number of CPU's and where can I find information about it? If not, obviously I am doing something wrong in previous sections of my code.

PS:我需要限制CPU内核的数量,因为在服务器中使用所有内核会给其他人带来很多麻烦.

PS: I need to limit the number of CPU cores because using all cores in a server causing a lot of trouble with other people.

其他信息和调试代码: 因此,已经有一段时间了,我仍然无法弄清楚这种现象的原因或如何限制一次使用的CPU内核数量.我决定提供一个示例代码对其进行测试.请注意,此代码段摘自 sklearn的网站.唯一的不同是增加了数据集的大小,因此可以轻松看到行为.

Additional information and debug code: So, it has been a while and I still couldn't figure out the reason for this behavior or how to limit the number of CPU cores used at a time. I've decided to provide a sample code to test it. Note that, this code snippet is taken from the sklearn's website. The only difference is made to increase the size of the dataset, so one can easily see the behavior.

from sklearn.datasets import load_digits
from sklearn.decomposition import IncrementalPCA
import numpy as np

X, _ = load_digits(return_X_y=True)

#Copy-paste and increase the size of the dataset to see the behavior at htop.
for _ in range(8):
    X = np.vstack((X, X))

print(X.shape)

transformer = IncrementalPCA(n_components=7, batch_size=200)
transformer.partial_fit(X[:100, :])
X_transformed = transformer.fit_transform(X)

print(X_transformed.shape)

输出为:

(460032, 64)
(460032, 7)

Process finished with exit code 0

并且htop显示:

And the htop shows:

推荐答案

我在另一篇帖子中寻找针对此问题的解决方法是我的,我发现这不是由于scikit-learn实现错误,而是由于numpy库使用的BLAS库(特别是OpenBLAS)引起的,该库在sklearn的IncrementalPCA函数中使用.默认情况下,OpenBLAS设置为使用所有可用线程.可以在此处找到详细信息.

I was looking for a workaround to this problem in another post of mine and I figured out this is not because of scikit-learn implementation fault but rather due to BLAS library (specifically OpenBLAS) used by numpy library, which is used in sklearn's IncrementalPCA function. OpenBLAS is set to use all available threads by default. Detailed information can be found here.

TL:DR通过在导入numpy或任何使用以下代码导入numpy的库之前设置BLAS环境变量来解决此问题.可以在此处找到详细信息.

TL:DR Solved the issue by setting BLAS environmental variables before importing numpy or any library that imports numpy with the code below. Detailed information can be found here.

import os
os.environ["OMP_NUM_THREADS"] = 1 # export OMP_NUM_THREADS=1
os.environ["OPENBLAS_NUM_THREADS"] = 1 # export OPENBLAS_NUM_THREADS=1
os.environ["MKL_NUM_THREADS"] = 1 # export MKL_NUM_THREADS=1
os.environ["VECLIB_MAXIMUM_THREADS"] = 1 # export VECLIB_MAXIMUM_THREADS=1
os.environ["NUMEXPR_NUM_THREADS"] = 1 # export NUMEXPR_NUM_THREADS=1

这篇关于局部拟合是否在sklearn.decomposition.IncrementalPCA中并行运行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆