在python中用numpy数组中的格式化数字转换字符串的最快方法是什么 [英] what is the fastest way in python to convert a string with formatted numbers in an numpy array

查看：59 发布时间：2021/6/15 19:16:07 python performance numpy

本文介绍了在python中用numpy数组中的格式化数字转换字符串的最快方法是什么的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个大的 ASCII 文件(~100GB)，它由大约 1.000.000 行已知格式的数字组成，我尝试用 python 处理这些数字.文件太大，无法完全读入内存，所以我决定逐行处理文件:

I have a large ASCII file (~100GB) which consists of roughly 1.000.000 lines of known formatted numbers which I try to process with python. The file is too large to read in completely into memory, so I decided to process the file line by line:

fp = open(file_name)
for count,line in enumerate(fp):
    data = np.array(line.split(),dtype=np.float)
    #do stuff
fp.close()

事实证明，我将程序的大部分运行时间都花在了 data = 行中.有什么方法可以加快那条线吗?此外，执行速度似乎比我从带有格式化读取的本机 FORTRAN 程序中获得的速度慢得多(请参阅此问题，我已经实现了一个 FORTRAN 字符串处理器并将它与 f2py 一起使用，但是运行时间只能与 data = 行.我猜 Python/FORTRAN 之间的 I/O 处理和类型转换扼杀了我从 FORTRAN 中获得的东西)

It turns out, that I spend most of the run time of my program in the data = line. Are there any ways to speed up that line? Also, the execution speed seem much slower than what I could get from an native FORTRAN program with formated read (see this question, I've implemented a FORTRAN string processor and used it with f2py, but the run time was only comparable with the data = line. I guess the I/O handling and type conversions between Python/FORTRAN killed what I gained from FORTRAN)

既然我知道格式，难道不应该有更好更快的方法来使用 split() 吗?类似的东西:

Since I know the formatting, shouldn't there be a better and faster way as to use split()? Something like:

data = readf(line,'(1000F20.10)')

我尝试了 fortranformat 包，效果很好，但在我的情况下是三倍比你的 split() 方法慢.

I tried the fortranformat package, which worked well, but in my case was three times slower than thee split() approach.

附言正如 ExP 和 root 所建议的，我尝试了 np.fromstring 并制作了这个快速而肮脏的基准:

P.S. As suggested by ExP and root I tried the np.fromstring and made this quick and dirtry benchmark:

t1 = time.time()
for i in range(500):
  data=np.array(line.split(),dtype=np.float)
t2 = time.time()    
print (t2-t1)/500
print data.shape
print data[0]
0.00160977363586
(9002,)
0.0015162509

和:

t1 = time.time()
for i in range(500):    
   data = np.fromstring(line,sep=' ',dtype=np.float,count=9002)
t2 = time.time()
print (t2-t1)/500
print data.shape
print data[0]
0.00159792804718
(9002,)
0.0015162509

所以 fromstring 实际上在我的情况下稍微慢一些.

so fromstring is in fact slightly slower in my case.

在python中用numpy数组中的格式化数字转换字符串的最快方法是什么 [英] what is the fastest way in python to convert a string with formatted numbers in an numpy array

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在python中用numpy数组中的格式化数字转换字符串的最快方法是什么 [英] what is the fastest way in python to convert a string with formatted numbers in an numpy array

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭