scikit-learn joblib错误:多处理池self.value超出'i'格式代码的范围,仅适用于大型numpy数组 [英] scikit-learn joblib bug: multiprocessing pool self.value out of range for 'i' format code, only with large numpy arrays

查看:98
本文介绍了scikit-learn joblib错误:多处理池self.value超出'i'格式代码的范围,仅适用于大型numpy数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的代码可以在较小的测试样本中正常运行,例如X_trainy_train中的10000行数据.当我为数以百万计的行调用它时,我得到了产生的错误.是程序包中的错误,还是我可以做一些不同的事情?我正在使用Anaconda 2.0.1中的Python 2.7.7,并且将 pool.py 来自Anaconda的多处理程序包,而 parallel.py 来自scikit-learn的外部程序包在我的Dropbox上为您服务.

My code runs fine with smaller test samples, like 10000 rows of data in X_train, y_train. When I call it for millions of rows, I get the resulting error. Is the bug in a package, or can I do something differently? I am using Python 2.7.7 from Anaconda 2.0.1, and I put the pool.py from Anaconda's multiprocessing package and parallel.py from scikit-learn's external package on my Dropbox for you.

测试脚本为:

import numpy as np
import sklearn
from sklearn.linear_model import SGDClassifier
from sklearn import grid_search
import multiprocessing as mp


def main():
    print("Started.")

    print("numpy:", np.__version__)
    print("sklearn:", sklearn.__version__)

    n_samples = 1000000
    n_features = 1000

    X_train = np.random.randn(n_samples, n_features)
    y_train = np.random.randint(0, 2, size=n_samples)

    print("input data size: %.3fMB" % (X_train.nbytes / 1e6))

    model = SGDClassifier(penalty='elasticnet', n_iter=10, shuffle=True)
    param_grid = [{
        'alpha' : 10.0 ** -np.arange(1,7),
        'l1_ratio': [.05, .15, .5, .7, .9, .95, .99, 1],
    }]
    gs = grid_search.GridSearchCV(model, param_grid, n_jobs=8, verbose=100)
    gs.fit(X_train, y_train)
    print(gs.grid_scores_)

if __name__=='__main__':
    mp.freeze_support()
    main()

这将导致输出:

Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Started.
('numpy:', '1.8.1')
('sklearn:', '0.15.0b1')
input data size: 8000.000MB
Fitting 3 folds for each of 48 candidates, totalling 144 fits
Memmaping (shape=(1000000L, 1000L), dtype=float64) to new file c:\users\laszlos\appdata\local\temp\4\joblib_memmaping_pool_6172_78765976\6172-284752304-75223296-0.pkl
Failed to save <type 'numpy.ndarray'> to .npy file:
Traceback (most recent call last):
  File "C:\Anaconda\lib\site-packages\sklearn\externals\joblib\numpy_pickle.py", line 240, in save
    obj, filename = self._write_array(obj, filename)
  File "C:\Anaconda\lib\site-packages\sklearn\externals\joblib\numpy_pickle.py", line 203, in _write_array
    self.np.save(filename, array)
  File "C:\Anaconda\lib\site-packages\numpy\lib\npyio.py", line 453, in save
    format.write_array(fid, arr)
  File "C:\Anaconda\lib\site-packages\numpy\lib\format.py", line 406, in write_array
    array.tofile(fp)
ValueError: 1000000000 requested and 268435456 written

Memmaping (shape=(1000000L, 1000L), dtype=float64) to old file c:\users\laszlos\appdata\local\temp\4\joblib_memmaping_pool_6172_78765976\6172-284752304-75223296-0.pkl
Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Traceback (most recent call last):
  File "S:\laszlo\gridsearch_largearray.py", line 33, in <module>
    main()
  File "S:\laszlo\gridsearch_largearray.py", line 28, in main
    gs.fit(X_train, y_train)
  File "C:\Anaconda\lib\site-packages\sklearn\grid_search.py", line 597, in fit
    return self._fit(X, y, ParameterGrid(self.param_grid))
  File "C:\Anaconda\lib\site-packages\sklearn\grid_search.py", line 379, in _fit
    for parameters in parameter_iterable
  File "C:\Anaconda\lib\site-packages\sklearn\externals\joblib\parallel.py", line 651, in __call__
    self.retrieve()
  File "C:\Anaconda\lib\site-packages\sklearn\externals\joblib\parallel.py", line 503, in retrieve
    self._output.append(job.get())
  File "C:\Anaconda\lib\multiprocessing\pool.py", line 558, in get
    raise self._value
struct.error: integer out of range for 'i' format code

ogrisel的答案确实适用于使用scikit-learn-0.15.0b1进行的手动内存映射.不要忘记一次只运行一个脚本,否则您仍然会耗尽内存并拥有太多线程. (我的运行需要大约60 GB的数据,其中CSV大小约为12.5 GB,具有8个线程.)

ogrisel's answer does work with manual memory mapping with scikit-learn-0.15.0b1. Don't forget to run only one script at once, otherwise you can still run out of memory and have too many threads. (My run take ~60 GB on data of size ~12.5 GB in CSV, with 8 threads.)

推荐答案

作为一种解决方法,您可以尝试显式地对数据进行内存映射&手动作为在joblib文档中解释.

As a workaround you can try to memory map your data explicitly & manually as explained in the joblib documentation.

编辑#1:这是重要的部分:

from sklearn.externals import joblib

joblib.dump(X_train, some_filename)
X_train = joblib.load(some_filename, mmap_mode='r+')

然后在scikit-learn 0.15+下将此内存映射的数据传递到GridSearchCV.

Then pass this memmap'ed data to GridSearchCV under scikit-learn 0.15+.

编辑#2::此外,如果您使用32位版本的Anaconda,则每个python进程将被限制为2GB,这也可能会限制内存.

Edit #2: Furthermore: if you use the 32bit version of Anaconda, you will be limited to 2GB per python process which can also limit the memory.

我刚刚在Python 3.4下找到了numpy.save bug ,即使已修复随后对mmap的调用将失败,并显示以下信息:

I just found a bug for numpy.save under Python 3.4 but even when fixed the subsequent call to mmap will fail with:

OSError: [WinError 8] Not enough storage is available to process this command

因此,请使用64位版本的Python(以Anaconda作为AFAIK,目前numpy/scipy/scikit-learn == 0.15.0b1目前没有其他64位软件包).

So please use a 64 bit version of Python (with Anaconda as AFAIK there is currently no other 64bit packages for numpy / scipy / scikit-learn==0.15.0b1 at this time).

编辑#3:我发现了另一个可能导致Windows下内存使用过多的问题:当前,joblib.Parallel内存默认情况下使用mmap_mode='c'映射输入数据:此写时复制设置似乎会导致Windows耗尽分页文件,有时会触发"[错误1455]分页文件太小,无法完成此操作"错误.设置mmap_mode='r'mmap_mode='r+'不会触发该问题.我将运行测试以查看是否可以在下一版本的joblib中更改默认模式.

Edit #3: I found another issue that might be causing excessive memory usage under windows: currently joblib.Parallel memory maps input data with mmap_mode='c' by default: this copy-on-write setting seems to cause windows to exhaust the paging file and sometimes triggers "[error 1455] the paging file is too small for this operation to complete" errors. Setting mmap_mode='r' or mmap_mode='r+' does not trigger that problem. I will run tests to see if I can change the default mode in the next version of joblib.

这篇关于scikit-learn joblib错误:多处理池self.value超出'i'格式代码的范围,仅适用于大型numpy数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆