在python中高效读取文件,需要在'\ n'上拆分 [英] Efficient file reading in python with need to split on '\n'

查看:76
本文介绍了在python中高效读取文件,需要在'\ n'上拆分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

传统上,我一直在使用以下命令读取文件:

I've traditionally been reading in files with:

file = open(fullpath, "r")
allrecords = file.read()
delimited = allrecords.split('\n')
for record in delimited[1:]:
    record_split = record.split(',')

with open(os.path.join(txtdatapath,pathfilename), "r") as data:
  datalines = (line.rstrip('\r\n') for line in data)
  for record in datalines:
    split_line = record.split(',')
    if len(split_line) > 1:

但是,当我在多处理线程中处理这些文件时,似乎出现了MemoryError.当我正在读取的文本文件需要在'\n'上拆分时,如何最好地逐行读取文件.

But it seems when I process these files in a multiprocessing thread I get MemoryError. How can I best readin files line by line, when the text file I'm reading needs to be split on '\n'.

这是多处理代码:

pool = Pool()
fixed_args = (targetdirectorytxt, value_dict)
varg = ((filename,) + fixed_args for filename in readinfiles)
op_list = pool.map_async(PPD_star, list(varg), chunksize=1)     
while not op_list.ready():
  print("Number of files left to process: {}".format(op_list._number_left))
  time.sleep(60)
op_list = op_list.get()
pool.close()
pool.join()

这是错误日志

Exception in thread Thread-3:
Traceback (most recent call last):
  File "C:\Python27\lib\threading.py", line 810, in __bootstrap_inner
    self.run()
  File "C:\Python27\lib\threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "C:\Python27\lib\multiprocessing\pool.py", line 380, in _handle_results
    task = get()
MemoryError

我正在按照Mike的建议安装pathos,但是我遇到了问题.这是我的安装命令:

I'm trying to install pathos as Mike has kindly suggested but I'm running into issues. Here is my install command:

pip install https://github.com/uqfoundation/pathos/zipball/master --allow-external pathos --pre

但这是我收到的错误消息:

But here are the error messages that I get:

Downloading/unpacking https://github.com/uqfoundation/pathos/zipball/master
  Running setup.py (path:c:\users\xxx\appdata\local\temp\2\pip-1e4saj-b
uild\setup.py) egg_info for package from https://github.com/uqfoundation/pathos/
zipball/master

Downloading/unpacking ppft>=1.6.4.5 (from pathos==0.2a1.dev0)
  Running setup.py (path:c:\users\xxx\appdata\local\temp\2\pip_build_jp
tyuser\ppft\setup.py) egg_info for package ppft

    warning: no files found matching 'python-restlib.spec'
Requirement already satisfied (use --upgrade to upgrade): dill>=0.2.2 in c:\pyth
on27\lib\site-packages\dill-0.2.2-py2.7.egg (from pathos==0.2a1.dev0)
Requirement already satisfied (use --upgrade to upgrade): pox>=0.2.1 in c:\pytho
n27\lib\site-packages\pox-0.2.1-py2.7.egg (from pathos==0.2a1.dev0)
Downloading/unpacking pyre==0.8.2.0-pathos (from pathos==0.2a1.dev0)
  Could not find any downloads that satisfy the requirement pyre==0.8.2.0-pathos
 (from pathos==0.2a1.dev0)
  Some externally hosted files were ignored (use --allow-external pyre to allow)
.
Cleaning up...
No distributions at all found for pyre==0.8.2.0-pathos (from pathos==0.2a1.dev0)

Storing debug log for failure in C:\Users\xxx\pip\pip.log

我正在Windows 7 64位上安装.最后,我设法通过easy_install进行安装.

I'm installing on Windows 7 64 bit. In the end I managed to install with easy_install.

但是现在我失败了,因为我无法打开那么多文件:

But Now I have a failure as I cannot open that many files:

Finished reading in Exposures...
Reading Samples from:  C:\XXX\XXX\XXX\
Traceback (most recent call last):
  File "events.py", line 568, in <module>
    mdrcv_dict = ReadDamages(damage_dir, value_dict)
  File "events.py", line 185, in ReadDamages
    res = thpool.amap(mppool.map, [rstrip]*len(readinfiles), files)
  File "C:\Python27\lib\site-packages\pathos-0.2a1.dev0-py2.7.egg\pathos\multipr
ocessing.py", line 230, in amap
    return _pool.map_async(star(f), zip(*args)) # chunksize
  File "events.py", line 184, in <genexpr>
    files = (open(name, 'r') for name in readinfiles[0:])
IOError: [Errno 24] Too many open files: 'C:\\xx.csv'

当前使用多重处理库,我正在将参数和字典传递到函数中,并打开映射的文件,然后输出字典.这是我目前的操作方式示例,如何用悲伤的方式来实现这一目标?

Currently using the multiprocessing library, I am passing in parameters and dictionaries into my function and opening a mapped file and then outputting a dictionary. Here is an example of how I currently do it, how would the smart way to do this with pathos?

def PP_star(args_flat):
    return PP(*args_flat)

def PP(pathfilename, txtdatapath, my_dict):
    return com_dict

fixed_args = (targetdirectorytxt, my_dict)
varg = ((filename,) + fixed_args for filename in readinfiles)
op_list = pool.map_async(PP_star, list(varg), chunksize=1)

如何使用pathos.multiprocessing

推荐答案

假设我们有 file1.txt :

hello35
1234123
1234123
hello32
2492wow
1234125
1251234
1234123
1234123
2342bye
1234125
1251234
1234123
1234123
1234125
1251234
1234123

file2.txt :

1234125
1251234
1234123
hello35
2492wow
1234125
1251234
1234123
1234123
hello32
1234125
1251234
1234123
1234123
1234123
1234123
2342bye

等等,通过 file5.txt :

1234123
1234123
1234125
1251234
1234123
1234123
1234123
1234125
1251234
1234125
1251234
1234123
1234123
hello35
hello32
2492wow
2342bye

我建议使用分层并行map快速读取文件. multiprocessing的分叉(称为pathos.multiprocessing)可以做到这一点.

I'd suggest to use a hierarchical parallel map to read your files quickly. A fork of multiprocessing (called pathos.multiprocessing) can do this.

>>> import pathos
>>> thpool = pathos.multiprocessing.ThreadingPool()
>>> mppool = pathos.multiprocessing.ProcessingPool()
>>> 
>>> def rstrip(line):
...     return line.rstrip()
... 
# get your list of files
>>> fnames = ['file1.txt', 'file2.txt', 'file3.txt', 'file4.txt', 'file5.txt']
>>> # open the files
>>> files = (open(name, 'r') for name in fnames)
>>> # read each file in asynchronous parallel
>>> # while reading and stripping each line in parallel
>>> res = thpool.amap(mppool.map, [rstrip]*len(fnames), files)
>>> # get the result when it's done
>>> res.ready()
True
>>> data = res.get()
>>> # if not using a files iterator -- close each file by uncommenting the next line
>>> # files = [file.close() for file in files]
>>> data[0]
['hello35', '1234123', '1234123', 'hello32', '2492wow', '1234125', '1251234', '1234123', '1234123', '2342bye', '1234125', '1251234', '1234123', '1234123', '1234125', '1251234', '1234123']
>>> data[1]
['1234125', '1251234', '1234123', 'hello35', '2492wow', '1234125', '1251234', '1234123', '1234123', 'hello32', '1234125', '1251234', '1234123', '1234123', '1234123', '1234123', '2342bye']
>>> data[-1]
['1234123', '1234123', '1234125', '1251234', '1234123', '1234123', '1234123', '1234125', '1251234', '1234125', '1251234', '1234123', '1234123', 'hello35', 'hello32', '2492wow', '2342bye']

但是,如果要检查还剩下多少文件,则可能要使用迭代"映射(imap)而不是异步"映射(amap).有关详细信息,请参见此帖子: Python多处理-跟踪pool.map操作的过程

However, if you want to check how many files you have left to finish, you might want to use an "iterated" map (imap) instead of an "asynchronous" map (amap). See this post for details: Python multiprocessing - tracking the process of pool.map operation

在此处获取pathos: https://github.com/uqfoundation

这篇关于在python中高效读取文件,需要在'\ n'上拆分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆