在python中用numpy数组中的格式化数字转换字符串的最快方法是什么 [英] what is the fastest way in python to convert a string with formatted numbers in an numpy array
问题描述
我有一个大的 ASCII 文件(~100GB),它由大约 1.000.000 行已知格式的数字组成,我尝试用 python 处理这些数字.文件太大,无法完全读入内存,所以我决定逐行处理文件:
I have a large ASCII file (~100GB) which consists of roughly 1.000.000 lines of known formatted numbers which I try to process with python. The file is too large to read in completely into memory, so I decided to process the file line by line:
fp = open(file_name)
for count,line in enumerate(fp):
data = np.array(line.split(),dtype=np.float)
#do stuff
fp.close()
事实证明,我将程序的大部分运行时间都花在了 data =
行中.有什么方法可以加快那条线吗?此外,执行速度似乎比我从带有格式化读取的本机 FORTRAN 程序中获得的速度慢得多(请参阅此 问题,我已经实现了一个 FORTRAN 字符串处理器并将它与 f2py 一起使用,但是运行时间只能与 data =
行.我猜 Python/FORTRAN 之间的 I/O 处理和类型转换扼杀了我从 FORTRAN 中获得的东西)
It turns out, that I spend most of the run time of my program in the data =
line. Are there any ways to speed up that line? Also, the execution speed seem much slower than what I could get from an native FORTRAN program with formated read (see this question, I've implemented a FORTRAN string processor and used it with f2py, but the run time was only comparable with the data =
line. I guess the I/O handling and type conversions between Python/FORTRAN killed what I gained from FORTRAN)
既然我知道格式,难道不应该有更好更快的方法来使用 split()
吗?类似的东西:
Since I know the formatting, shouldn't there be a better and faster way as to use split()
? Something like:
data = readf(line,'(1000F20.10)')
我尝试了 fortranformat 包,效果很好,但在我的情况下是三倍比你的 split()
方法慢.
I tried the fortranformat package, which worked well, but in my case was three times slower than thee split()
approach.
附言正如 ExP 和 root 所建议的,我尝试了 np.fromstring 并制作了这个快速而肮脏的基准:
P.S. As suggested by ExP and root I tried the np.fromstring and made this quick and dirtry benchmark:
t1 = time.time()
for i in range(500):
data=np.array(line.split(),dtype=np.float)
t2 = time.time()
print (t2-t1)/500
print data.shape
print data[0]
0.00160977363586
(9002,)
0.0015162509
和:
t1 = time.time()
for i in range(500):
data = np.fromstring(line,sep=' ',dtype=np.float,count=9002)
t2 = time.time()
print (t2-t1)/500
print data.shape
print data[0]
0.00159792804718
(9002,)
0.0015162509
所以 fromstring
实际上在我的情况下稍微慢一些.
so fromstring
is in fact slightly slower in my case.
推荐答案
你试过 numpyp.fromstring
?
np.fromstring(line, dtype=np.float, sep=" ")
这篇关于在python中用numpy数组中的格式化数字转换字符串的最快方法是什么的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!