为什么在使用mrjob v0.4.4时出现[Errno 7]参数列表太长和OSError:[Errno 24]打开的文件太多? [英] Why am I getting [Errno 7] Argument list too long and OSError: [Errno 24] Too many open files when using mrjob v0.4.4?

查看:79
本文介绍了为什么在使用mrjob v0.4.4时出现[Errno 7]参数列表太长和OSError:[Errno 24]打开的文件太多?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

似乎MapReduce框架的本质是要处理许多文件.因此,当我收到告诉我使用的文件过多的错误时,我怀疑我在做错什么.

It seems like the nature of the MapReduce framework is to work with many files. So when I get errors that tell me I'm using too many files, I suspect I'm doing something wrong.

如果我使用inline运行器和三个目录运行该作业,则它可以正常工作:

If I run the job with the inline runner and three directories, it works:

$ python mr_gps_quality.py  /Volumes/Logs/gps/ByCityLogs/city1/0[1-3]/*.log -r inline --no-output --output-dir city1_results/gps_quality/2015/03/

但是,如果我使用local运行程序(以及相同的三个目录)运行它,则会失败:

But if I run it using the local runner (and the same three directories), it fails:

$ python mr_gps_quality.py  /Volumes/Logs/gps/ByCityLogs/city1/0[1-3]/*.log -r local --no-output --output-dir city1_results/gps_quality/2015/03/

[...output clipped...]

> /Users/andrewsturges/sturges/mr/env/bin/python mr_gps_quality.py --step-num=0 --mapper /var/folders/32/5vqk9bjx4c773cpq4pn_r80c0000gn/T/mr_gps_quality.andrewsturges.20150604.170016.046323/input_part-00249 > /var/folders/32/5vqk9bjx4c773cpq4pn_r80c0000gn/T/mr_gps_quality.andrewsturges.20150604.170016.046323/step-k0-mapper_part-00249
Traceback (most recent call last):
  File "mr_gps_quality.py", line 53, in <module>
    MRGPSQuality.run()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/job.py", line 494, in run
    mr_job.execute()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/job.py", line 512, in execute
    super(MRJob, self).execute()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/launch.py", line 147, in execute
    self.run_job()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/launch.py", line 208, in run_job
    runner.run()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/runner.py", line 458, in run
    self._run()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/sim.py", line 182, in _run
    self._invoke_step(step_num, 'mapper')
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/sim.py", line 269, in _invoke_step
    working_dir, env)
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/local.py", line 150, in _run_step
    procs_args, output_path, working_dir, env)
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/local.py", line 253, in _invoke_processes
    cwd=working_dir, env=env)
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/local.py", line 76, in _chain_procs
    proc = Popen(args, **proc_kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 711, in __init__
    errread, errwrite)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1197, in _execute_child
    errpipe_read, errpipe_write = self.pipe_cloexec()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1153, in pipe_cloexec
    r, w = os.pipe()
OSError: [Errno 24] Too many open files

此外,如果我回到使用内联运行器并在输入中包含更多目录(总共11个),那么我会再次遇到另一个错误:

Furthermore, if I go back to using the inline runner and include even more directories (11 total) in my input, then I get a different error again:

$ python mr_gps_quality.py  /Volumes/Logs/gps/ByCityLogs/city1/*/*.log -r inline --no-output --output-dir city1_results/gps_quality/2015/03/

[...clipped...]

Traceback (most recent call last):
  File "mr_gps_quality.py", line 53, in <module>
    MRGPSQuality.run()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/job.py", line 494, in run 
    mr_job.execute()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/job.py", line 512, in execute
    super(MRJob, self).execute()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/launch.py", line 147, in execute
    self.run_job()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/launch.py", line 208, in run_job
    runner.run()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/runner.py", line 458, in run 
    self._run()
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/sim.py", line 191, in _run
    self._invoke_sort(self._step_input_paths(), sort_output_path)
  File "/Users/andrewsturges/sturges/mr/env/lib/python2.7/site-packages/mrjob/runner.py", line 1202, in _invoke_sort
    check_call(args, stdout=output, stderr=err, env=env)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 537, in check_call
    retcode = call(*popenargs, **kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 524, in call
    return Popen(*popenargs, **kwargs).wait()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 711, in __init__
    errread, errwrite)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1308, in _execute_child
    raise child_exception
OSError: [Errno 7] Argument list too long

mrjob文档包括讨论了inlinelocal跑步者,但我不知道它如何解释这种行为.

The mrjob docs include a discussion of the differences between the inline and local runners, but I don't understand how it would explain this behavior.

最后,我要提到的是,我正在遍历的目录中的文件数量不是很大(确认) :

Lastly, I'll mention that the number of files in the directories I'm globbing isn't huge (acknowledgement):

$ find . -maxdepth 1 -mindepth 1 -type d | while read dir; do   printf "%-25.25s : " "$dir";   find "$dir" -type f | wc -l; done | sort
./01                      :      236
./02                      :      169
./03                      :      176
./04                      :      185
./05                      :      176
./06                      :      235
./07                      :      275
./08                      :      265
./09                      :      186
./10                      :      171
./11                      :      161

我认为这与工作本身无关,但是在这里:

I don't think this has to do with the job itself, but here it is:

from mrjob.job import MRJob
import numpy as np
import geohash

class MRGPSQuality(MRJob):

    def mapper(self, _, line):

        try:
            lat = float(line.split(',')[1])
            lng = float(line.split(',')[2])
            horizontalAccuracy = float(line.split(',')[4])
            gh = geohash.encode(lat, lng, precision=7)
            yield gh, horizontalAccuracy
        except:
            pass

    def reducer(self, key, values):
        # Convert the generator straight back to array:
        vals = np.fromiter(values, float)
        count = len(vals)
        mean = np.mean(vals)
        if count > 50:
            yield key, [count, mean]

if __name__ == '__main__':
    MRGPSQuality.run()

推荐答案

参数列表过长"的问题不是作业或python,而是bash.命令行中用于启动作业的星号会扩展到与之匹配的每个文件,这是一个非常长的命令行,并且超过了bash限制.

The problem for "Argument list too long" is not the job or python, its bash. The asterisk in your command line to kick off the job expands out to every file that matches which is a really long command line and exceeds bash limit.

该错误与ulimit无关,但错误与打开的文件过多"与ulimit有关,因此,如果命令实际上要运行,则会遇到ulimit.

The error has nothing to do with ulimit but the error "Too many open files" is to do with ulimit, so you bump into the ulimit if the command were to actually run.

您可以像这样检查炮弹极限(如果您有兴趣)... getconf ARG_MAX

You can check the shells limit like this (if you are interested)... getconf ARG_MAX

要解决最大args问题,您可以通过执行以下操作将所有文件连接为一个文件.

To get around the max args problem, you can concatenate all the files into one by doing this.

for f in *; do cat "$f" >> ../directory/bigfile.log; done

然后运行指向大文件的mrjob.

Then run your mrjob pointed at the big file.

如果文件很多,则可以使用gnu parallel使用多个线程来连接文件,因为上述命令是单线程且速度较慢.

If its a lot of files you can use multiple threads to concat the file using gnu parallel because above command is single thread and slow.

ls | parallel -m -j 8 "cat {} >> ../files/bigfile.log"

*将8更改为所需的并行度

*Change 8 to the amount of parallelism you want

这篇关于为什么在使用mrjob v0.4.4时出现[Errno 7]参数列表太长和OSError:[Errno 24]打开的文件太多?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆