并行作业未在scikit-learn的GridSearchCV中完成 [英] Parallel jobs don't finish in scikit-learn's GridSearchCV

查看:87
本文介绍了并行作业未在scikit-learn的GridSearchCV中完成的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在以下脚本中,我发现GridSearchCV启动的作业似乎挂起了.

In the following script, I'm finding that the jobs launched by GridSearchCV seem to hang.

import json
import pandas as pd
import numpy as np
import unicodedata
import re
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import SGDClassifier
import sklearn.cross_validation as CV
from sklearn.grid_search import GridSearchCV
from nltk.stem import WordNetLemmatizer

# Seed for randomization. Set to some definite integer for debugging and set to None for production
seed = None


### Text processing functions ###

def normalize(string):#Remove diacritics and whatevs
    return "".join(ch.lower() for ch in unicodedata.normalize('NFD', string) if not unicodedata.combining(ch))

wnl = WordNetLemmatizer()
def tokenize(string):#Ignores special characters and punct
    return [wnl.lemmatize(token) for token in re.compile('\w\w+').findall(string)]

def ngrammer(tokens):#Gets all grams in each ingredient
    max_n = 2
    return [":".join(tokens[idx:idx+n]) for n in np.arange(1,1 + min(max_n,len(tokens))) for idx in range(len(tokens) + 1 - n)]

print("Importing training data...")
with open('/Users/josh/dev/kaggle/whats-cooking/data/train.json','rt') as file:
    recipes_train_json = json.load(file)

# Build the grams for the training data
print('\nBuilding n-grams from input data...')
for recipe in recipes_train_json:
    recipe['grams'] = [term for ingredient in recipe['ingredients'] for term in ngrammer(tokenize(normalize(ingredient)))]

# Build vocabulary from training data grams. 
vocabulary = list({gram for recipe in recipes_train_json for gram in recipe['grams']})

# Stuff everything into a dataframe. 
ids_index = pd.Index([recipe['id'] for recipe in recipes_train_json],name='id')
recipes_train = pd.DataFrame([{'cuisine': recipe['cuisine'], 'ingredients': " ".join(recipe['grams'])} for recipe in recipes_train_json],columns=['cuisine','ingredients'], index=ids_index)


# Extract data for fitting
fit_data = recipes_train['ingredients'].values
fit_target = recipes_train['cuisine'].values

# extracting numerical features from the ingredient text
feature_ext = Pipeline([('vect', CountVectorizer(vocabulary=vocabulary)),
                        ('tfidf', TfidfTransformer(use_idf=True)),
                        ('svd', TruncatedSVD(n_components=1000))
])
lsa_fit_data = feature_ext.fit_transform(fit_data)

# Build SGD Classifier
clf =  SGDClassifier(random_state=seed)
# Hyperparameter grid for GRidSearchCV. 
parameters = {
    'alpha': np.logspace(-6,-2,5),
}

# Init GridSearchCV with k-fold CV object
cv = CV.KFold(lsa_fit_data.shape[0], n_folds=3, shuffle=True, random_state=seed)
gs_clf = GridSearchCV(
    estimator=clf,
    param_grid=parameters,
    n_jobs=-1,
    cv=cv,
    scoring='accuracy',
    verbose=2    
)
# Fit on training data
print("\nPerforming grid search over hyperparameters...")
gs_clf.fit(lsa_fit_data, fit_target)

控制台输出为:

Importing training data...

Building n-grams from input data...

Performing grid search over hyperparameters...
Fitting 3 folds for each of 5 candidates, totalling 15 fits
[CV] alpha=1e-06 .....................................................
[CV] alpha=1e-06 .....................................................
[CV] alpha=1e-06 .....................................................
[CV] alpha=1e-05 .....................................................
[CV] alpha=1e-05 .....................................................
[CV] alpha=1e-05 .....................................................
[CV] alpha=0.0001 ....................................................
[CV] alpha=0.0001 .................................................... 

然后挂起.如果我在GridSearchCV中设置了n_jobs=1,则脚本将按预期完成输出:

And then it just hangs. If I set n_jobs=1 in GridSearchCV, then the script completes as expected with output:

Importing training data...

Building n-grams from input data...

Performing grid search over hyperparameters...
Fitting 3 folds for each of 5 candidates, totalling 15 fits
[CV] alpha=1e-06 .....................................................
[CV] ............................................ alpha=1e-06 -   6.5s
[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:    6.6s
[CV] alpha=1e-06 .....................................................
[CV] ............................................ alpha=1e-06 -   6.6s
[CV] alpha=1e-06 .....................................................
[CV] ............................................ alpha=1e-06 -   6.7s
[CV] alpha=1e-05 .....................................................
[CV] ............................................ alpha=1e-05 -   6.7s
[CV] alpha=1e-05 .....................................................
[CV] ............................................ alpha=1e-05 -   6.7s
[CV] alpha=1e-05 .....................................................
[CV] ............................................ alpha=1e-05 -   6.6s
[CV] alpha=0.0001 ....................................................
[CV] ........................................... alpha=0.0001 -   6.6s
[CV] alpha=0.0001 ....................................................
[CV] ........................................... alpha=0.0001 -   6.7s
[CV] alpha=0.0001 ....................................................
[CV] ........................................... alpha=0.0001 -   6.7s
[CV] alpha=0.001 .....................................................
[CV] ............................................ alpha=0.001 -   7.0s
[CV] alpha=0.001 .....................................................
[CV] ............................................ alpha=0.001 -   6.8s
[CV] alpha=0.001 .....................................................
[CV] ............................................ alpha=0.001 -   6.6s
[CV] alpha=0.01 ......................................................
[CV] ............................................. alpha=0.01 -   6.7s
[CV] alpha=0.01 ......................................................
[CV] ............................................. alpha=0.01 -   7.3s
[CV] alpha=0.01 ......................................................
[CV] ............................................. alpha=0.01 -   7.1s
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:  1.7min finished

单线程执行很快完成,所以我确定我给并行作业用例足够的时间来进行计算本身.

The single-threaded execution finishes pretty quickly so I'm sure I'm giving the parallel job case enough time to do the calculation itself.

环境规格: MacBook Pro(15英寸,2010年中),2.4 GHz Intel Core i5、8 GB 1067 MHz DDR3,OSX 10.10.5,python 3.4.3,ipython 3.2.0,numpy v1.9.3,scipy 0.16.0,scikit-学习v0.16.1(python和软件包都来自anaconda发行版)

Environment specs: MacBook Pro (15-inch, Mid 2010), 2.4 GHz Intel Core i5, 8 GB 1067 MHz DDR3, OSX 10.10.5, python 3.4.3, ipython 3.2.0, numpy v1.9.3, scipy 0.16.0, scikit-learn v0.16.1 (python and packages all from anaconda distro)

一些其他评论:

我一直在此计算机上始终将n_jobs=-1GridSearchCV一起使用,因此我的平台确实支持该功能.它通常一次有4个工作,因为我在这台计算机上有4个核心(2个物理核心,但由于Mac超线程而有4个虚拟核心").但是,除非我误解了控制台输出,否则在这种情况下它将输出8个作业,而不会返回任何作业.在Activity Monitor中实时监视CPU使用率,先启动4个作业,再进行一些工作,然后完成(或死亡?),然后再启动4个作业,再进行一些工作,然后完全闲置,然后坚持下来.

I use n_jobs=-1 with GridSearchCV all the time on this machine without issue, so my platform does support the functionality. It usually has 4 jobs out a time, as I've got 4 cores on this machine (2 physical, but 4 "virtual cores" due to Mac hyperthreading). But unless I misunderstand the console output, in this case it has 8 jobs out without any returning. Watching CPU usage in Activity Monitor in real time, 4 jobs launch, work a bit, then finish (or die?) followed by 4 more that launch, work a bit, and then go completely idle but stick around.

在任何时候我都看不到明显的内存压力.主进程的顶部内存约为1GB,子进程的顶部内存约为600MB.当它们挂起时,实际内存可以忽略不计.

At no point do I see significant memory pressure. The main process tops at about 1GB real mem, the child processes at around 600MB. By the time they hang, real memory is negligible.

如果一个脚本从特征提取管道中删除了TruncatedSVD步骤,则该脚本可以很好地用于多个任务.但是请注意,该管道在网格搜索之前起作用,并且不是GridSearchCV作业的一部分.

The script works fine with multiple jobs if one removes the TruncatedSVD step from the feature extraction pipeline. Note, though, that this pipeline acts before the grid search and is not part of the GridSearchCV job(s).

此脚本适用于kaggle竞赛烹饪是什么?尝试在我使用的相同数据上运行它,您可以从那里获取它.数据以对象的JSON数组的形式出现.每个对象代表一个配方,并包含一系列文本片段,这些片段是其成分.由于每个样本都是文档的集合,而不是单个文档,因此我最终不得不编写一些自己的n-gramming和tokenization逻辑,因为我无法弄清楚如何将scikit-learn的内置转换器转换为完全按照我的意愿去做.我怀疑其中有什么很重要,仅是仅供参考.

This script is for the kaggle competition What's Cooking? so if you want to try run it on the same data I'm using, you can grab it from there. The data comes as a JSON array of objects. Each object represents a recipe and contains a list of text snippets which are the ingredients. Since each sample is a collection of documents instead of a single document, I ended up having to write some of my own n-gramming and tokenization logic since I couldn't figure out how to get the built-in transformers of scikit-learn to do exactly what I want. I doubt any of that matters but just an FYI.

我通常使用%run在iPython CLI中运行脚本,但是我从OSX bash终端直接使用python(3.4.3)运行该脚本时,会得到相同的行为.

I usually run scripts within the iPython CLI with %run, but I get the same behavior running it from the OSX bash terminal with python (3.4.3) directly.

推荐答案

如果njob> 1,这可能是GridSearchCV使用的多处理问题.因此,您可以尝试使用多线程来查看其是否工作正常,而不是使用多处理.

This might be an issue with multiprocessing used by GridSearchCV if njob>1. So rather than using multiprocessing, you can try multithreading to see if it works fine.

from sklearn.externals.joblib import parallel_backend

clf = GridSearchCV(...)
with parallel_backend('threading'):
    clf.fit(x_train, y_train)

使用njob> 1的GSV进行估算时,我遇到了同样的问题,并且在njob值之间使用时效果很好.

I was having the same issue with my estimator using GSV with njob >1 and using this works great across njob values.

PS:对于所有估计量,我不确定线程"是否具有与多处理"相同的优势.但是从理论上讲,如果您的估算器受到GIL的限制,线程"将不是一个很好的选择,但是如果估算器是基于cython/numpy的,它将比多处理"更好

PS: I am not sure if "threading" would have same advantages as "multiprocessing" for all estimators. But theoretically, "threading" would not be a great choice if your estimator is limited by GIL but if the estimator is a cython/numpy based it would be better than "multiprocessing"

系统尝试:

MAC OS: 10.12.6
Python: 3.6
numpy==1.13.3
pandas==0.21.0
scikit-learn==0.19.1

这篇关于并行作业未在scikit-learn的GridSearchCV中完成的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆