MultiProcessing中的类似错误.函数的参数数量不匹配 [英] Similar errors in MultiProcessing. Mismatch number of arguments to function

查看:440
本文介绍了MultiProcessing中的类似错误.函数的参数数量不匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我找不到更好的方法来描述所面临的错误,但是每次尝试对循环调用实现多处理时,都会出现此错误.

I couldn't find a better way to describe the error I'm facing, but this error seems to come up everytime I try to implement Multiprocessing to a loop call.

我同时使用了sklearn.externals.joblib和multiprocessing.Process,但是错误是相似的,尽管有所不同.

I've used both sklearn.externals.joblib as well as multiprocessing.Process but error are similar though different.

要在其上应用多重处理的原始循环,其中一个迭代在单个线程/进程中执行

for dd in final_col_dates:
    idx1 = final_col_dates.tolist().index(dd)

    dataObj = GetPrevDataByDate(d1, a, dd, self.start_hour_of_day)
    data2 = dataObj.fit()

    dataObj = GetAppointmentControlsSchedule(data2, idx1, d, final_col_dates_mod, dd, self.DC, frgt_typ_filter)
    data3 = dataObj.fit()

    if idx1 > 0:
       data3['APPT_SCHD_ARVL_D_{}'.format(idx1)] = np.nan

    iter += 1

    days_out_vars.append(data3)

为了将上面的代码片段实现为多重处理",我创建了一个方法,除了 for循环之外,上面的代码都可以使用.

For implementing the above code snipet as Multi Processing, I created a method, where the above code goes except the for loop.

使用Joblib,以下是我的代码段.

Using Joblib, the following is my code snippet.

Parallel(n_jobs=2)(
            delayed(self.ParallelLoopTest)(dd, final_col_dates, d1, a, d, final_col_dates_mod, iter, return_list)
                    for dd in final_col_dates)

变量 return_list 是在方法ParallelLoopTest内部执行的共享变量.它声明为:

the variable return_list is shared variable which is executed inside method ParallelLoopTest. it is declared as :

manager = Manager()
return_list = manager.list()

使用上面的代码段,我遇到以下错误:

Using the above code snippet, I face the following error:

Process SpawnPoolWorker-3:
Traceback (most recent call last):
File "C:\Users\dkanhar\Anaconda3\lib\multiprocessing\process.py", line 249, in _bootstrap
  self.run()
File "C:\Users\dkanhar\Anaconda3\lib\multiprocessing\process.py", line 93, in run
  self._target(*self._args, **self._kwargs)
File "C:\Users\dkanhar\Anaconda3\lib\multiprocessing\pool.py", line 108, in worker
  task = get()
File "C:\Users\dkanhar\Anaconda3\lib\site-packages\sklearn\externals\joblib\pool.py", line 359, in get
  return recv()
File "C:\Users\dkanhar\Anaconda3\lib\multiprocessing\connection.py", line 251, in recv
  return ForkingPickler.loads(buf.getbuffer())
TypeError: function takes at most 0 arguments (1 given)

我还尝试了多处理模块来执行上述代码,但仍然遇到类似的错误.以下代码用于使用多处理模块运行:

I also tried multiprocessing module to execute the above mentioned code, and still faced similar error. The following code was used to run using multiprocessing module:

for dd in final_col_dates:
    # multiprocessing.Pipe(False)
    p = multiprocessing.Process(target=self.ParallelLoopTest, args=(dd, final_col_dates, d1, a, d, final_col_dates_mod, iter, return_list))
    jobs.append(p)
    p.start()

for proc in jobs:
    proc.join()

而且,我面临以下错误的追溯:

And, I face the following traceback of error:

File "<string>", line 1, in <module>
File "C:\Users\dkanhar\Anaconda3\lib\multiprocessing\spawn.py", line 106, in spawn_main
   exitcode = _main(fd)
File "C:\Users\dkanhar\Anaconda3\lib\multiprocessing\spawn.py", line 116, in _main
   self = pickle.load(from_parent)
TypeError: function takes at most 0 arguments (1 given)
Traceback (most recent call last):
File "E:/Projects/Predictive Inbound Cartoon Estimation-MLO/Python/dataprep/DataPrep.py", line 457, in <module>
   print(obj.fit())
File "E:/Projects/Predictive Inbound Cartoon Estimation-MLO/Python/dataprep/DataPrep.py", line 39, in fit
return self.__driver__()
File "E:/Projects/Predictive Inbound Cartoon Estimation-MLO/Python/dataprep/DataPrep.py", line 52, in __driver__
   final = self.process_()
File "E:/Projects/Predictive Inbound Cartoon Estimation-MLO/Python/dataprep/DataPrep.py", line 135, in process_
   sch_dat = self.inline_apply_(all_dates_schd, d1, d2, a)
File "E:/Projects/Predictive Inbound Cartoon Estimation-MLO/Python/dataprep/DataPrep.py", line 297, in inline_apply_
   p.start()
File "C:\Users\dkanhar\Anaconda3\lib\multiprocessing\process.py", line 105, in start
   self._popen = self._Popen(self)
File "C:\Users\dkanhar\Anaconda3\lib\multiprocessing\context.py", line 212, in _Popen
   return _default_context.get_context().Process._Popen(process_obj)
File "C:\Users\dkanhar\Anaconda3\lib\multiprocessing\context.py", line 313, in _Popen
   return Popen(process_obj)
File "C:\Users\dkanhar\Anaconda3\lib\multiprocessing\popen_spawn_win32.py", line 66, in __init__
   reduction.dump(process_obj, to_child)
File "C:\Users\dkanhar\Anaconda3\lib\multiprocessing\reduction.py", line 59, in dump
   ForkingPickler(file, protocol).dump(obj)
   BrokenPipeError: [Errno 32] Broken pipe

因此,我尝试取消注释 multiprocessing.Pipe(False)行的注释,认为这可能是由于使用了Pipe(已禁用),但问题仍然存在,并且我遇到了同样的错误.

So, I tried uncommenting the line multiprocessing.Pipe(False) thinking it is maybe because of using Pipe, which I disabled, but still the problem persists and I face same error.

如果有帮助,以下是我的ParallerLoopTest方法:

If of any help, following is my method ParallerLoopTest:

def ParallelLoopTest(self, dd, final_col_dates, d1, a, d, final_col_dates_mod, iter, days_out_vars):
    idx1 = final_col_dates.tolist().index(dd)

    dataObj = GetPrevDataByDate(d1, a, dd, self.start_hour_of_day)
    data2 = dataObj.fit()

    dataObj = GetAppointmentControlsSchedule(data2, idx1, d, final_col_dates_mod, dd, self.DC, frgt_typ_filter)
    data3 = dataObj.fit()

    if idx1 > 0:
        data3['APPT_SCHD_ARVL_D_{}'.format(idx1)] = np.nan

    print("Iter ", iter)
    iter += 1

    days_out_vars.append(data3)

之所以说类似的错误,是因为如果您查看这两个错误的回溯,它们之间的错误线也相似:

The reason why I said similar errors is because if you look at Traceback of both errors, they both have similar error line inbetween:

TypeError:从Pickle加载时函数最多接受0个参数(给定1个),我不知道为什么会这样.

TypeError: function takes at most 0 arguments (1 given) while loading from Pickle which I dont know why it is happening.

还请注意,我已经在其他项目中成功实现了这两个模块,但是从未遇到过问题,因此我不知道为什么现在开始出现此问题,以及这个问题的确切含义.

Also note, that I've successfully implemented both of these modules in other projects earlier, but never faced an issue, so I dont know why this problem started coming up now, and what exactly this problem means.

任何帮助将不胜感激,因为自三天以来我一直在浪费时间调试它.

Any help would be really appreciated, as I've been wasting time to debug this since 3 days.

谢谢

在最后一个答案后编辑1

回答后,我尝试了以下方法. 添加装饰器 @staticmethod ,删除自身,并使用DataPrep.ParallelLoopTest(args)调用该方法.

After answer, the following this I tried. added decorator @staticmethod, removed self, and called the method using DataPrep.ParallelLoopTest(args).

还将方法从DataPrep类中移出,并简单地由ParallelLoopTest(args)调用,

Also, moved the method out of class DataPrep, and called simply by ParallelLoopTest(args),

,但是在两种情况下,错误仍然相同.

but in both cases the error remains same.

PS:在这两种情况下,我都尝试使用joblib. 因此,这两种解决方案均无效.

PS: I tried using joblib for both cases. So, neither of solutions worked.

新方法定义:

def ParallelLoopTest(dd, final_col_dates, d1, a, d, final_col_dates_mod, iter, days_out_vars, DC, start_hour):
    idx1 = final_col_dates.tolist().index(dd)

    dataObj = GetPrevDataByDate(d1, a, dd, start_hour_of_day)
    data2 = dataObj.fit()

    dataObj = GetAppointmentControlsSchedule(data2, idx1, d, final_col_dates_mod, dd, DC, frgt_typ_filter)
    data3 = dataObj.fit()

    if idx1 > 0:
        data3['APPT_SCHD_ARVL_D_{}'.format(idx1)] = np.nan

    print("Iter ", iter)
    iter += 1

    days_out_vars.append(data3)

我遇到了错误,因为Python无法腌制一些大型数据框.我的参数/自变量中有2个DataFrame,其中20MB左右为一个,而pickle格式为200MB.但这不应该成为问题吗?我们应该能够传递Pandas DataFrame.如果我错了,请纠正我.

I was facing error as Python was unable to pickle some large dataframes. I had 2 DataFrames in my parameter/arguments, one around 20MB other 200MB in pickle format. But that shouldn't be an issue right? We should be able to pass Pandas DataFrame. Correct me if I'm wrong.

此外,解决方法是我在使用随机名称进行方法调用之前将DataFrame保存为csv,传递文件名并读取csv,但这是一个缓慢的过程,因为它涉及到大量csv文件.有什么建议吗?

Also, workaround this was I saved the DataFrame as csv before method call with a random name, pass the file name, and read csv, but that is slow process as it involved reasong huge csv files. Any suggestions?

推荐答案

在两种情况下,您实际上都得到了完全相同的错误,但是在一个示例中,使用Pool(joblib),在示例中使用了Process其他人则在主线程中没有相同的失败/回溯,因为他们没有以相同的方式管理流程失败.
在这两种情况下,您的过程似乎都无法在新的Process中释放您的子任务. Pool会给您带来反拾取错误,而使用Process会导致失败,因为当子进程因该反拾取错误而死亡时,它会关闭主线程用于写入数据的管道,从而导致主进程出错

You actually get the exact same error in both case but as you use a Pool in one example (joblib) and a Process in the other you don't get the same failure/traceback in your main thread as they do not manage the Process failure the same way.
In both cases, your process seems to fail to unpickle your child job in the new Process. The Pool give you back the unpickling error whereas using Process, you get a failure as when the subprocess dies from this unpickling error, it closes the pipe used by the main thread to write data, causing an error in the main process.

我的第一个想法是,当您尝试腌制实例方法时会引起该错误,而您应该在此处尝试使用静态方法(使用实例方法似乎不正确,因为对象不在进程之间共享).
在声明ParallelLoopTest并删除self参数之前,请使用装饰器@staticmethod.

My first idea would be that the error is caused as you try to pickle an instance method whereas you should try to use a static method here (using an instance method does not seem right as the object is not shared between processes).
Use the decorator @staticmethod before you declare ParallelLoopTest and remove the self argument.

另一种可能性是参数dd, final_col_dates, d1, a, d, final_col_dates_mod, iter, return_list之一不能被解开.显然,它来自panda.DataFrame.
我看不出在这种情况下无法解开失败的任何原因,但我不太了解panda.
一种解决方法是将数据转储到临时文件中.您可以此处查看此链接,以有效地序列化.另一种解决方案是使用DataFrame.to_pickle方法和panda.read_pickle将其转储到文件或从文件中检索出来.

Another possibility is that one of the arguments dd, final_col_dates, d1, a, d, final_col_dates_mod, iter, return_list cannot be unpickled. Apparently, it comes from panda.DataFrame.
I do not see any reason why the unpickling fail in this case but I don't know panda that well.
One work around is to dump the data in a temporary file. You can look at this link here for efficient serialization of panda.DataFrame. Another solution is to use the DataFrame.to_pickle method and panda.read_pickle to dump/retrieve it to/from a file.

请注意,最好将joblib.Parallelmultiprocessing.Pool进行比较,而不是与multiprocessing.Process进行比较.

Note that it would be better to compare joblib.Parallel with multiprocessing.Pool and not with multiprocessing.Process.

这篇关于MultiProcessing中的类似错误.函数的参数数量不匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆