在python中高效读取文件,需要在'\ n'上拆分 [英] Efficient file reading in python with need to split on '\n'
问题描述
传统上,我一直在使用以下命令读取文件:
I've traditionally been reading in files with:
file = open(fullpath, "r")
allrecords = file.read()
delimited = allrecords.split('\n')
for record in delimited[1:]:
record_split = record.split(',')
和
with open(os.path.join(txtdatapath,pathfilename), "r") as data:
datalines = (line.rstrip('\r\n') for line in data)
for record in datalines:
split_line = record.split(',')
if len(split_line) > 1:
但是,当我在多处理线程中处理这些文件时,似乎出现了MemoryError.当我正在读取的文本文件需要在'\n'
上拆分时,如何最好地逐行读取文件.
But it seems when I process these files in a multiprocessing thread I get MemoryError. How can I best readin files line by line, when the text file I'm reading needs to be split on '\n'
.
这是多处理代码:
pool = Pool()
fixed_args = (targetdirectorytxt, value_dict)
varg = ((filename,) + fixed_args for filename in readinfiles)
op_list = pool.map_async(PPD_star, list(varg), chunksize=1)
while not op_list.ready():
print("Number of files left to process: {}".format(op_list._number_left))
time.sleep(60)
op_list = op_list.get()
pool.close()
pool.join()
这是错误日志
Exception in thread Thread-3:
Traceback (most recent call last):
File "C:\Python27\lib\threading.py", line 810, in __bootstrap_inner
self.run()
File "C:\Python27\lib\threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "C:\Python27\lib\multiprocessing\pool.py", line 380, in _handle_results
task = get()
MemoryError
我正在按照Mike的建议安装pathos,但是我遇到了问题.这是我的安装命令:
I'm trying to install pathos as Mike has kindly suggested but I'm running into issues. Here is my install command:
pip install https://github.com/uqfoundation/pathos/zipball/master --allow-external pathos --pre
但这是我收到的错误消息:
But here are the error messages that I get:
Downloading/unpacking https://github.com/uqfoundation/pathos/zipball/master
Running setup.py (path:c:\users\xxx\appdata\local\temp\2\pip-1e4saj-b
uild\setup.py) egg_info for package from https://github.com/uqfoundation/pathos/
zipball/master
Downloading/unpacking ppft>=1.6.4.5 (from pathos==0.2a1.dev0)
Running setup.py (path:c:\users\xxx\appdata\local\temp\2\pip_build_jp
tyuser\ppft\setup.py) egg_info for package ppft
warning: no files found matching 'python-restlib.spec'
Requirement already satisfied (use --upgrade to upgrade): dill>=0.2.2 in c:\pyth
on27\lib\site-packages\dill-0.2.2-py2.7.egg (from pathos==0.2a1.dev0)
Requirement already satisfied (use --upgrade to upgrade): pox>=0.2.1 in c:\pytho
n27\lib\site-packages\pox-0.2.1-py2.7.egg (from pathos==0.2a1.dev0)
Downloading/unpacking pyre==0.8.2.0-pathos (from pathos==0.2a1.dev0)
Could not find any downloads that satisfy the requirement pyre==0.8.2.0-pathos
(from pathos==0.2a1.dev0)
Some externally hosted files were ignored (use --allow-external pyre to allow)
.
Cleaning up...
No distributions at all found for pyre==0.8.2.0-pathos (from pathos==0.2a1.dev0)
Storing debug log for failure in C:\Users\xxx\pip\pip.log
我正在Windows 7 64位上安装.最后,我设法通过easy_install进行安装.
I'm installing on Windows 7 64 bit. In the end I managed to install with easy_install.
但是现在我失败了,因为我无法打开那么多文件:
But Now I have a failure as I cannot open that many files:
Finished reading in Exposures...
Reading Samples from: C:\XXX\XXX\XXX\
Traceback (most recent call last):
File "events.py", line 568, in <module>
mdrcv_dict = ReadDamages(damage_dir, value_dict)
File "events.py", line 185, in ReadDamages
res = thpool.amap(mppool.map, [rstrip]*len(readinfiles), files)
File "C:\Python27\lib\site-packages\pathos-0.2a1.dev0-py2.7.egg\pathos\multipr
ocessing.py", line 230, in amap
return _pool.map_async(star(f), zip(*args)) # chunksize
File "events.py", line 184, in <genexpr>
files = (open(name, 'r') for name in readinfiles[0:])
IOError: [Errno 24] Too many open files: 'C:\\xx.csv'
当前使用多重处理库,我正在将参数和字典传递到函数中,并打开映射的文件,然后输出字典.这是我目前的操作方式示例,如何用悲伤的方式来实现这一目标?
Currently using the multiprocessing library, I am passing in parameters and dictionaries into my function and opening a mapped file and then outputting a dictionary. Here is an example of how I currently do it, how would the smart way to do this with pathos?
def PP_star(args_flat):
return PP(*args_flat)
def PP(pathfilename, txtdatapath, my_dict):
return com_dict
fixed_args = (targetdirectorytxt, my_dict)
varg = ((filename,) + fixed_args for filename in readinfiles)
op_list = pool.map_async(PP_star, list(varg), chunksize=1)
如何使用pathos.multiprocessing
推荐答案
假设我们有 file1.txt :
hello35
1234123
1234123
hello32
2492wow
1234125
1251234
1234123
1234123
2342bye
1234125
1251234
1234123
1234123
1234125
1251234
1234123
file2.txt :
1234125
1251234
1234123
hello35
2492wow
1234125
1251234
1234123
1234123
hello32
1234125
1251234
1234123
1234123
1234123
1234123
2342bye
等等,通过 file5.txt :
1234123
1234123
1234125
1251234
1234123
1234123
1234123
1234125
1251234
1234125
1251234
1234123
1234123
hello35
hello32
2492wow
2342bye
我建议使用分层并行map
快速读取文件.
multiprocessing
的分叉(称为pathos.multiprocessing
)可以做到这一点.
I'd suggest to use a hierarchical parallel map
to read your files quickly.
A fork of multiprocessing
(called pathos.multiprocessing
) can do this.
>>> import pathos
>>> thpool = pathos.multiprocessing.ThreadingPool()
>>> mppool = pathos.multiprocessing.ProcessingPool()
>>>
>>> def rstrip(line):
... return line.rstrip()
...
# get your list of files
>>> fnames = ['file1.txt', 'file2.txt', 'file3.txt', 'file4.txt', 'file5.txt']
>>> # open the files
>>> files = (open(name, 'r') for name in fnames)
>>> # read each file in asynchronous parallel
>>> # while reading and stripping each line in parallel
>>> res = thpool.amap(mppool.map, [rstrip]*len(fnames), files)
>>> # get the result when it's done
>>> res.ready()
True
>>> data = res.get()
>>> # if not using a files iterator -- close each file by uncommenting the next line
>>> # files = [file.close() for file in files]
>>> data[0]
['hello35', '1234123', '1234123', 'hello32', '2492wow', '1234125', '1251234', '1234123', '1234123', '2342bye', '1234125', '1251234', '1234123', '1234123', '1234125', '1251234', '1234123']
>>> data[1]
['1234125', '1251234', '1234123', 'hello35', '2492wow', '1234125', '1251234', '1234123', '1234123', 'hello32', '1234125', '1251234', '1234123', '1234123', '1234123', '1234123', '2342bye']
>>> data[-1]
['1234123', '1234123', '1234125', '1251234', '1234123', '1234123', '1234123', '1234125', '1251234', '1234125', '1251234', '1234123', '1234123', 'hello35', 'hello32', '2492wow', '2342bye']
但是,如果要检查还剩下多少文件,则可能要使用迭代"映射(imap
)而不是异步"映射(amap
).有关详细信息,请参见此帖子: Python多处理-跟踪pool.map操作的过程
However, if you want to check how many files you have left to finish, you might want to use an "iterated" map (imap
) instead of an "asynchronous" map (amap
). See this post for details: Python multiprocessing - tracking the process of pool.map operation
在此处获取pathos
: https://github.com/uqfoundation
这篇关于在python中高效读取文件,需要在'\ n'上拆分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!