scikit-learn joblib 错误:多处理池 self.value 超出“i"格式代码的范围,仅适用于大型 numpy 数组 [英] scikit-learn joblib bug: multiprocessing pool self.value out of range for 'i' format code, only with large numpy arrays

查看:21
本文介绍了scikit-learn joblib 错误:多处理池 self.value 超出“i"格式代码的范围,仅适用于大型 numpy 数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的代码在较小的测试样本上运行良好,例如 X_trainy_train 中的 10000 行数据.当我为数百万行调用它时,我得到了结果错误.包中的错误,还是我可以做一些不同的事情?我正在使用 Anaconda 2.0.1 中的 Python 2.7.7,我把 pool.py 来自 Anaconda 的多处理包和 parallel.py 来自 scikit-learn 的外部包在我的 Dropbox 上给你.

My code runs fine with smaller test samples, like 10000 rows of data in X_train, y_train. When I call it for millions of rows, I get the resulting error. Is the bug in a package, or can I do something differently? I am using Python 2.7.7 from Anaconda 2.0.1, and I put the pool.py from Anaconda's multiprocessing package and parallel.py from scikit-learn's external package on my Dropbox for you.

测试脚本是:

import numpy as np
import sklearn
from sklearn.linear_model import SGDClassifier
from sklearn import grid_search
import multiprocessing as mp


def main():
    print("Started.")

    print("numpy:", np.__version__)
    print("sklearn:", sklearn.__version__)

    n_samples = 1000000
    n_features = 1000

    X_train = np.random.randn(n_samples, n_features)
    y_train = np.random.randint(0, 2, size=n_samples)

    print("input data size: %.3fMB" % (X_train.nbytes / 1e6))

    model = SGDClassifier(penalty='elasticnet', n_iter=10, shuffle=True)
    param_grid = [{
        'alpha' : 10.0 ** -np.arange(1,7),
        'l1_ratio': [.05, .15, .5, .7, .9, .95, .99, 1],
    }]
    gs = grid_search.GridSearchCV(model, param_grid, n_jobs=8, verbose=100)
    gs.fit(X_train, y_train)
    print(gs.grid_scores_)

if __name__=='__main__':
    mp.freeze_support()
    main()

这导致输出:

Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Started.
('numpy:', '1.8.1')
('sklearn:', '0.15.0b1')
input data size: 8000.000MB
Fitting 3 folds for each of 48 candidates, totalling 144 fits
Memmaping (shape=(1000000L, 1000L), dtype=float64) to new file c:userslaszlosappdatalocal	emp4joblib_memmaping_pool_6172_787659766172-284752304-75223296-0.pkl
Failed to save <type 'numpy.ndarray'> to .npy file:
Traceback (most recent call last):
  File "C:Anacondalibsite-packagessklearnexternalsjoblib
umpy_pickle.py", line 240, in save
    obj, filename = self._write_array(obj, filename)
  File "C:Anacondalibsite-packagessklearnexternalsjoblib
umpy_pickle.py", line 203, in _write_array
    self.np.save(filename, array)
  File "C:Anacondalibsite-packages
umpylib
pyio.py", line 453, in save
    format.write_array(fid, arr)
  File "C:Anacondalibsite-packages
umpylibformat.py", line 406, in write_array
    array.tofile(fp)
ValueError: 1000000000 requested and 268435456 written

Memmaping (shape=(1000000L, 1000L), dtype=float64) to old file c:userslaszlosappdatalocal	emp4joblib_memmaping_pool_6172_787659766172-284752304-75223296-0.pkl
Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Traceback (most recent call last):
  File "S:laszlogridsearch_largearray.py", line 33, in <module>
    main()
  File "S:laszlogridsearch_largearray.py", line 28, in main
    gs.fit(X_train, y_train)
  File "C:Anacondalibsite-packagessklearngrid_search.py", line 597, in fit
    return self._fit(X, y, ParameterGrid(self.param_grid))
  File "C:Anacondalibsite-packagessklearngrid_search.py", line 379, in _fit
    for parameters in parameter_iterable
  File "C:Anacondalibsite-packagessklearnexternalsjoblibparallel.py", line 651, in __call__
    self.retrieve()
  File "C:Anacondalibsite-packagessklearnexternalsjoblibparallel.py", line 503, in retrieve
    self._output.append(job.get())
  File "C:Anacondalibmultiprocessingpool.py", line 558, in get
    raise self._value
struct.error: integer out of range for 'i' format code

ogrisel 的答案确实适用于 scikit-learn-0.15.0b1 的手动内存映射.不要忘记一次只运行一个脚本,否则你仍然会耗尽内存并且线程过多.(我的 CSV 文件大小约为 12.5 GB 的数据,使用 8 个线程,运行大约 60 GB.)

ogrisel's answer does work with manual memory mapping with scikit-learn-0.15.0b1. Don't forget to run only one script at once, otherwise you can still run out of memory and have too many threads. (My run take ~60 GB on data of size ~12.5 GB in CSV, with 8 threads.)

推荐答案

作为一种解决方法,您可以尝试显式地对数据进行内存映射 &手动 在 joblib 文档中解释.

As a workaround you can try to memory map your data explicitly & manually as explained in the joblib documentation.

编辑#1:这是重要的部分:

from sklearn.externals import joblib

joblib.dump(X_train, some_filename)
X_train = joblib.load(some_filename, mmap_mode='r+')

然后将这个 memmap 的数据传递给 scikit-learn 0.15+ 下的 GridSearchCV.

Then pass this memmap'ed data to GridSearchCV under scikit-learn 0.15+.

编辑 #2: 此外:如果您使用 32 位版本的 Anaconda,每个 python 进程将被限制为 2GB,这也会限制内存.

Edit #2: Furthermore: if you use the 32bit version of Anaconda, you will be limited to 2GB per python process which can also limit the memory.

我刚刚为 numpy.save 找到了一个 bug在 Python 3.4 下,但即使修复后对 mmap 的后续调用也会失败:

I just found a bug for numpy.save under Python 3.4 but even when fixed the subsequent call to mmap will fail with:

OSError: [WinError 8] Not enough storage is available to process this command

所以请使用 64 位版本的 Python(Anaconda 作为 AFAIK,目前没有其他 64 位软件包用于 numpy/scipy/scikit-learn==0.15.0b1).

So please use a 64 bit version of Python (with Anaconda as AFAIK there is currently no other 64bit packages for numpy / scipy / scikit-learn==0.15.0b1 at this time).

编辑 #3: 我发现了另一个可能导致 windows 下内存使用过多的问题:当前 joblib.Parallel 内存映射输入数据与 mmap_mode='c' 默认情况下:此写时复制设置似乎会导致窗口耗尽分页文件,有时会触发[错误 1455] 分页文件太小,无法完成此操作"错误.设置 mmap_mode='r'mmap_mode='r+' 不会触发该问题.我将运行测试,看看我是否可以在下一个版本的 joblib 中更改默认模式.

Edit #3: I found another issue that might be causing excessive memory usage under windows: currently joblib.Parallel memory maps input data with mmap_mode='c' by default: this copy-on-write setting seems to cause windows to exhaust the paging file and sometimes triggers "[error 1455] the paging file is too small for this operation to complete" errors. Setting mmap_mode='r' or mmap_mode='r+' does not trigger that problem. I will run tests to see if I can change the default mode in the next version of joblib.

这篇关于scikit-learn joblib 错误:多处理池 self.value 超出“i"格式代码的范围,仅适用于大型 numpy 数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆