Python函数在打开时从文件中读取可变长度的数据块 [英] Python function to read variable length blocks of data from file while open

查看：112 发布时间：2018/4/17 18:30:57 python function numpy file-processing

本文介绍了Python函数在打开时从文件中读取可变长度的数据块的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有数据文件包含多个时间步的数据，每个时间步的格式如下所示：

  TIMESTEP PARTICLES 
 0.00500103 1262 
 ID组体积质量PX PY PZ VX VY VZ 
 651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903 
 430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903 
 384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903 
 971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903 
 ....

每个块由3个标题行和多行与时间步相关的数据（第2行的int）。与该块关联的数据行数可以从0到10百万变化。每个块之间可能有空行，但有时会丢失。

我希望能够逐块读取文件，在读取块后处理数据 - 文件很大（通常超过200GB），一个时间步所有这些都可以轻松地加载到内存中。

由于文件格式的原因，我认为编写一个读取3个标题行的函数会非常容易，数据，然后返回一个漂亮的numpy数组进行数据处理。
我习惯 MATLAB ，您可以在块中读取数据，而不是在文件末尾。我不太清楚如何用python做到这一点。

我创建了以下函数来读取数据块：
def readBlock（f）： particleData = [] Timestep = [] numParticles = [] linesProcessed = 0 line = f.readline（）。strip（）如果line.startswith（'TIMESTEP'）： timestepHeaders = line.strip（） varData = f.readline（）。strip（） headerStrings = f.readline（）。strip（）。split（''） parts = varData.strip（）。split（' '） Timestep = float（parts [0]） numParticles = int（parts [1]）$ b $ b while linesProcessed< numParticles： particleData.append（tuple（f.readline（）。strip（）.split（''））） linesProcessed + = 1 mydt = np.dtype （[''''，int），（'GROUP'，int），（'Vol'，float），（'Mass'，float），（'Px'，float），（'Py'，float），（'Pz'，float），（'Vx'，float），（' Vy'，float），（'Vz'，float）， ]） particleData = np.array（particleData，dtype = mydt） return Timestep，numParticles，particleData
我尝试运行如下函数：
与open（fileOpenPath，'r'）作为文件： startWallTime = time.clock（） Timestep，numParticles，particleData = readBlock（文件） print（Timestep） ##在这里处理的东西 print（Timestep Processed） endWallTime = time.clock（）
问题是这只能读取文件中的第一个数据块并停在那里 - 我不知道如何让它在文件中循环播放直到它结束并停止播放。

关于如何使这项工作的任何建议都会很棒。我认为我可以用单线处理来写一种方法，用很多if检查来查看我是否处于时间步的末尾，但简单的函数看起来更容易和更清晰。
解决方案

您可以使用 max_rows 参数/doc/numpy/reference/generated/numpy.genfromtxt.htmlrel =nofollow noreferrer> numpy.genfromtxt ：

 以open（timesteps.dat，rb）作为f：
，而True：
 line = f.readline（ ）
 if len（line）== 0：
＃文件结尾
 break 
＃跳过空行
 while len（line.strip（））== 0：
 line = f.readline（）
 line2_fields = f.readline（）。split（）
 timestep = float（line2_fields [0]）
 particles = int（line2_fields [b] b 
 
 print（f，names = True，dtype = None，max_rows = particles）
 data = np.genfromtxt （数据：）
打印（数据）
打印（）
打印（粒子：，粒子）
打印/ code>

以下是一个示例文件：

  TIMESTEP PARTICLES 
 0.00500103 4 
 ID组体积质量PX PY PZ VX VY VZ 
 651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903 
 430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903 
 384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903 
 971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903 
 TIMESTEP PARTICLES 
 0.00500103 5 
 ID组体积质量PX PY PZ VX VY VZ 
 971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0  - 1.04903 
 652 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903 
 431 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903 
 385 0 5.23599e- 07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903 
 972 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903 
 
 TIMESTEP PARTICLES 
 0.00500103 3 
 ID GROUP VOLUME MASS PX PY PZ VX VY VZ 
 222 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903 
 333 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903 
 444 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903

这里是输出： $ pre $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ （651,0,5.23599e-07,0.000397935，-0.084626，-0.0347849,0.00188164,0.0，-1.04903）（430,0,5.23599e-07,0.000397935，-0.0837742，-0.0442293,0.0121046 ，0,0，-1.04903）（384,0,5.23599e-07,0.000397935，-0.0749234，-0.0395652,0.0143401,0.0，-1.04903）（971,0,5.23599e -07，0.000397935，-0.0954931，-0.0159607,0.0100155,0,0，-1.04903）] T imestep：0.00500103 粒子：5 数据： [（971,0,5.23599e-07,0.000397935,0.0954931,0.0159607,0.0100155,0,0.1.04903）（652,0,5.23599e-07,0.000397935，-0.084626，-0.0347849,0.00188164,0.0，-1.04903）（431,0,5.23599e-07,0.000397935，-0.0837742，-0.0442293 ，0.0121046,0,0，-1.04903）（385,0,5.23599e-07,0.000397935，-0.0749234，-0.0395652,0.0143401,0.0，-1.04903）（972,0， 5.23599e-07，0.000397935，-0.0954931，-0.0159607,0.0100155,0,0，-1.04903）] 时间步：0.00500103 粒子：3 数据： [（222,0,5.23599e-07,0.000397935，-0.0837742，-0.0442293,0.0121046,0.0，-1.04903）（333,0,5.23599e-07,0.000397935，-0.0749234，-0.0395652 ，0.0143401,0,0，-1.04903）（444,0,5.23599e-07,0.000397935，-0.0954931，-0.0159607,0.0100155,0.0，-1.04903）]

I have data files that contain data for many timesteps, with each timestep formatted in a block like this:

TIMESTEP  PARTICLES
0.00500103 1262
ID  GROUP  VOLUME  MASS  PX  PY  PZ  VX  VY  VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
....

Each block consists of the 3 header lines and a number of lines of data related to the timestep (int on line 2). The number of lines of data associated with the block can vary from 0 to 10 Million. Each block may have a blank line between them, but sometimes this is missing.

I want to be able to read the file block by block, processing the data after reading the block - the files are large (often over 200GB) and one timestep is about all that can be comfortably loaded into memory.

Because of the file format I thought it would be quite easy to write a function that reads the 3 header lines, reads the actual data and then return a nice numpy array for data processing. I'm used to MATLAB where you can simply read in blocks while not at the end of file. I'm not quite sure how to do this with python.

I created the following function to read the block of data:
def readBlock(f): particleData = [] Timestep = [] numParticles = [] linesProcessed = 0 line = f.readline().strip() if line.startswith('TIMESTEP'): timestepHeaders = line.strip() varData = f.readline().strip() headerStrings = f.readline().strip().split(' ') parts = varData.strip().split(' ') Timestep = float(parts[0]) numParticles = int(parts[1]) while linesProcessed < numParticles: particleData.append(tuple(f.readline().strip().split(' '))) linesProcessed += 1 mydt = np.dtype([ ('ID',int), ('GROUP', int), ('Vol', float), ('Mass', float), ('Px', float), ('Py', float), ('Pz', float), ('Vx', float), ('Vy', float), ('Vz', float), ] ) particleData = np.array(particleData, dtype=mydt) return Timestep, numParticles, particleData
I try to run the function like this:
with open(fileOpenPath, 'r') as file: startWallTime = time.clock() Timestep, numParticles, particleData = readBlock(file) print(Timestep) ## Do processing stuff here print("Timestep Processed") endWallTime = time.clock()
The problem is this only reads the first block of data from the file and stops there - I don't know how to make it loop through the file until it hits the end and stops.

Any suggestions on how to make this work would be great. I think I can write a way of doing it using single line processing with lots of if checks to see if i'm at the end of the timestep, but the simple function seemed easier and clearer.
解决方案
You can use the max_rows argument of numpy.genfromtxt:
with open("timesteps.dat", "rb") as f: while True: line = f.readline() if len(line) == 0: # End of file break # Skip blank lines while len(line.strip()) == 0: line = f.readline() line2_fields = f.readline().split() timestep = float(line2_fields[0]) particles = int(line2_fields[1]) data = np.genfromtxt(f, names=True, dtype=None, max_rows=particles) print("Timestep:", timestep) print("Particles:", particles) print("Data:") print(data) print()
Here's a sample file:
TIMESTEP PARTICLES 0.00500103 4 ID GROUP VOLUME MASS PX PY PZ VX VY VZ 651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903 430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903 384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903 971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903 TIMESTEP PARTICLES 0.00500103 5 ID GROUP VOLUME MASS PX PY PZ VX VY VZ 971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903 652 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903 431 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903 385 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903 972 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903 TIMESTEP PARTICLES 0.00500103 3 ID GROUP VOLUME MASS PX PY PZ VX VY VZ 222 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903 333 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903 444 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
And here is the output:
Timestep: 0.00500103 Particles: 4 Data: [ (651, 0, 5.23599e-07, 0.000397935, -0.084626, -0.0347849, 0.00188164, 0, 0, -1.04903) (430, 0, 5.23599e-07, 0.000397935, -0.0837742, -0.0442293, 0.0121046, 0, 0, -1.04903) (384, 0, 5.23599e-07, 0.000397935, -0.0749234, -0.0395652, 0.0143401, 0, 0, -1.04903) (971, 0, 5.23599e-07, 0.000397935, -0.0954931, -0.0159607, 0.0100155, 0, 0, -1.04903)] Timestep: 0.00500103 Particles: 5 Data: [ (971, 0, 5.23599e-07, 0.000397935, -0.0954931, -0.0159607, 0.0100155, 0, 0, -1.04903) (652, 0, 5.23599e-07, 0.000397935, -0.084626, -0.0347849, 0.00188164, 0, 0, -1.04903) (431, 0, 5.23599e-07, 0.000397935, -0.0837742, -0.0442293, 0.0121046, 0, 0, -1.04903) (385, 0, 5.23599e-07, 0.000397935, -0.0749234, -0.0395652, 0.0143401, 0, 0, -1.04903) (972, 0, 5.23599e-07, 0.000397935, -0.0954931, -0.0159607, 0.0100155, 0, 0, -1.04903)] Timestep: 0.00500103 Particles: 3 Data: [ (222, 0, 5.23599e-07, 0.000397935, -0.0837742, -0.0442293, 0.0121046, 0, 0, -1.04903) (333, 0, 5.23599e-07, 0.000397935, -0.0749234, -0.0395652, 0.0143401, 0, 0, -1.04903) (444, 0, 5.23599e-07, 0.000397935, -0.0954931, -0.0159607, 0.0100155, 0, 0, -1.04903)]

这篇关于Python函数在打开时从文件中读取可变长度的数据块的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Python函数在打开时从文件中读取可变长度的数据块 [英] Python function to read variable length blocks of data from file while open

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python函数在打开时从文件中读取可变长度的数据块 [英] Python function to read variable length blocks of data from file while open

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭