为什么 scikit-learn 邻居在 n_jobs > 时变慢?1和forkserver [英] Why scikit-learn neighbors is slower with n_jobs > 1 and forkserver
问题描述
我正在使用 scikit-learn 进行元启发式练习,但我有一个疑问:我需要使用 knn,所以我有一个带有 n_jobs=-1 的 KNearestNeighbors 对象.正如文档所说,我必须将多处理模式设置为 forkserver.但是 n_jobs=-1 时的 knn 比 n_jobs=1 时慢得多.
I'm using scikit-learn for doing Metaheuristics exercises and I have a doubt: I need to use knn, so I have a KNearestNeighbors object with n_jobs=-1. As the docs said, I have to set the multiprocessing mode to forkserver. But the knn is soooo slower with n_jobs=-1 than with n_jobs=1.
这是一段代码
### Some initialization here ###
skf = StratifiedKFold(target, n_folds=2, shuffle=True)
for train_index, test_index in skf:
data_train, data_test = data[train_index], data[test_index]
target_train, target_test = target[train_index], target[test_index]
start = time()
selected_features, score = SFS(data_train, data_test, target_train, target_test, knn)
end = time()
logger.info("SFS - Time elapsed: " + str(end-start) + ". Score: " + str(score) + ". Selected features: " + str(sum(selected_features)))
if __name__ == "__main__":
import multiprocessing as mp; mp.set_start_method('forkserver', force = True)
main()
这是SFS函数
def SFS(data_train, data_test, target_train, target_test, classifier):
rowsize = len(data_train[0])
selected_features = np.zeros(rowsize, dtype=np.bool)
best_score = 0
best_feature = 0
while best_feature is not None:
end = True
best_feature = None
for idx in range(rowsize):
if selected_features[idx]:
continue
selected_features[idx] = True
classifier.fit(data_train[:,selected_features], target_train)
score = classifier.score(data_test[:,selected_features], target_test)
selected_features[idx] = False
if score > best_score:
best_score = score
best_feature = idx
if best_feature is not None:
selected_features[best_feature] = True
return selected_features, best_score
我不明白 n_jobs > 1 怎么会比 n_jobs = 1 慢.谁能解释一下?我试过 3 个数据集.
I don't understand how can n_jobs > 1 be slower than n_jobs = 1. Can anyone explain me that? I've tried with 3 dataset.
推荐答案
我发现很多人和你一样有同样的问题:n_jobs 在 sklearn 的 KNearestNeighbors 中不起作用.他们还抱怨只加载了 1 个 CPU 内核.
I found out many of people like you had same problem : n_jobs is not working in KNearestNeighbors of sklearn. And they also complained that just 1 CPU core was loaded.
在我的实验中,无论 n_jobs>1 与否,拟合过程都只使用单核.所以不管你把n_jobs设置的很大,如果你的训练数据样本很大,训练的时间会很大,不会减少.
In my experiment, fitting process uses just single core whether n_jobs>1 or not. So whether you set n_jobs as large number, if your train data sample is large, the time for training will be huge and not reduced.
n_jobs>1 比 n_jobs=1 更慢的原因是为多处理分配资源的成本.
And the reason n_jobs>1 is even more slow than n_jobs=1 is because of the cost to distribute resources for multiprocessing.
这篇关于为什么 scikit-learn 邻居在 n_jobs > 时变慢?1和forkserver的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!