Python函数在打开时从文件中读取可变长度的数据块 [英] Python function to read variable length blocks of data from file while open

查看:112
本文介绍了Python函数在打开时从文件中读取可变长度的数据块的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有数据文件包含多个时间步的数据,每个时间步的格式如下所示:

  TIMESTEP PARTICLES 
0.00500103 1262
ID组体积质量PX PY PZ VX VY VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
....

每个块由3个标题行和多行与时间步相关的数据(第2行的int)。与该块关联的数据行数可以从0到10百万变化。每个块之间可能有空行,但有时会丢失。



我希望能够逐块读取文件,在读取块后处理数据 - 文件很大(通常超过200GB),一个时间步所有这些都可以轻松地加载到内存中。

由于文件格式的原因,我认为编写一个读取3个标题行的函数会非常容易,数据,然后返回一个漂亮的numpy数组进行数据处理。
我习惯 MATLAB ,您可以在块中读取数据,而不是在文件末尾。我不太清楚如何用python做到这一点。



我创建了以下函数来读取数据块:

  def readBlock(f):
particleData = []
Timestep = []
numParticles = []
linesProcessed = 0

line = f.readline()。strip()
如果line.startswith('TIMESTEP'):

timestepHeaders = line.strip()
varData = f.readline()。strip()
headerStrings = f.readline()。strip()。split('')
parts = varData.strip()。split(' ')
Timestep = float(parts [0])
numParticles = int(parts [1])$ ​​b $ b while linesProcessed< numParticles:
particleData.append(tuple(f.readline()。strip().split('')))
linesProcessed + = 1

mydt = np.dtype (['''',int),
('GROUP',int),
('Vol',float),
('Mass',float),
('Px',float),
('Py',float),
('Pz',float),
('Vx',float),
(' Vy',float),
('Vz',float),
])

particleData = np.array(particleData,dtype = mydt)

return Timestep,numParticles,particleData

我尝试运行如下函数:

 与open(fileOpenPath,'r')作为文件:
startWallTime = time.clock()

Timestep,numParticles,particleData = readBlock(文件)
print(Timestep)

##在这里处理的东西
print(Timestep Processed)

endWallTime = time.clock()

问题是这只能读取文件中的第一个数据块并停在那里 - 我不知道如何让它在文件中循环播放直到它结束并停止播放。



关于如何使这项工作的任何建议都会很棒。我认为我可以用单线处理来写一种方法,用很多if检查来查看我是否处于时间步的末尾,但简单的函数看起来更容易和更清晰。

解决方案

您可以使用 max_rows 参数/doc/numpy/reference/generated/numpy.genfromtxt.htmlrel =nofollow noreferrer> numpy.genfromtxt

 以open(timesteps.dat,rb)作为f:
,而True:
line = f.readline( )
if len(line)== 0:
#文件结尾
break
#跳过空行
while len(line.strip())== 0:
line = f.readline()
line2_fields = f.readline()。split()
timestep = float(line2_fields [0])
particles = int(line2_fields [b] b

print(f,names = True,dtype = None,max_rows = particles)
data = np.genfromtxt (数据:)
打印(数据)
打印()
打印(粒子:,粒子)
打印/ code>

以下是一个示例文件:

  TIMESTEP PARTICLES 
0.00500103 4
ID组体积质量PX PY PZ VX VY VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
TIMESTEP PARTICLES
0.00500103 5
ID组体积质量PX PY PZ VX VY VZ
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 - 1.04903
652 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
431 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
385 0 5.23599e- 07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
972 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903

TIMESTEP PARTICLES
0.00500103 3
ID GROUP VOLUME MASS PX PY PZ VX VY VZ
222 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
333 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
444 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903

这里是输出:

$ pre $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ (651,0,5.23599e-07,0.000397935,-0.084626,-0.0347849,0.00188164,0.0,-1.04903)
(430,0,5.23599e-07,0.000397935,-0.0837742,-0.0442293,0.0121046 ,0,0,-1.04903)
(384,0,5.23599e-07,0.000397935,-0.0749234,-0.0395652,0.0143401,0.0,-1.04903)
(971,0,5.23599e -07,0.000397935,-0.0954931,-0.0159607,0.0100155,0,0,-1.04903)]

T imestep:0.00500103
粒子:5
数据:
[(971,0,5.23599e-07,0.000397935,0.0954931,0.0159607,0.0100155,0,0.1.04903)
(652,0,5.23599e-07,0.000397935,-0.084626,-0.0347849,0.00188164,0.0,-1.04903)
(431,0,5.23599e-07,0.000397935,-0.0837742,-0.0442293 ,0.0121046,0,0,-1.04903)
(385,0,5.23599e-07,0.000397935,-0.0749234,-0.0395652,0.0143401,0.0,-1.04903)
(972,0, 5.23599e-07,0.000397935,-0.0954931,-0.0159607,0.0100155,0,0,-1.04903)]

时间步:0.00500103
粒子:3
数据:
[(222,0,5.23599e-07,0.000397935,-0.0837742,-0.0442293,0.0121046,0.0,-1.04903)
(333,0,5.23599e-07,0.000397935,-0.0749234,-0.0395652 ,0.0143401,0,0,-1.04903)
(444,0,5.23599e-07,0.000397935,-0.0954931,-0.0159607,0.0100155,0.0,-1.04903)]


I have data files that contain data for many timesteps, with each timestep formatted in a block like this:

TIMESTEP  PARTICLES
0.00500103 1262
ID  GROUP  VOLUME  MASS  PX  PY  PZ  VX  VY  VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
....

Each block consists of the 3 header lines and a number of lines of data related to the timestep (int on line 2). The number of lines of data associated with the block can vary from 0 to 10 Million. Each block may have a blank line between them, but sometimes this is missing.

I want to be able to read the file block by block, processing the data after reading the block - the files are large (often over 200GB) and one timestep is about all that can be comfortably loaded into memory.

Because of the file format I thought it would be quite easy to write a function that reads the 3 header lines, reads the actual data and then return a nice numpy array for data processing. I'm used to MATLAB where you can simply read in blocks while not at the end of file. I'm not quite sure how to do this with python.

I created the following function to read the block of data:

def readBlock(f):
    particleData = []
    Timestep = []
    numParticles = []
    linesProcessed = 0

    line = f.readline().strip()
    if line.startswith('TIMESTEP'): 

        timestepHeaders = line.strip()
        varData = f.readline().strip()
        headerStrings = f.readline().strip().split(' ')
        parts = varData.strip().split(' ')
        Timestep = float(parts[0])
        numParticles = int(parts[1])
        while linesProcessed < numParticles:
            particleData.append(tuple(f.readline().strip().split(' ')))
            linesProcessed += 1

        mydt = np.dtype([ ('ID',int), 
                     ('GROUP', int),
                     ('Vol', float),
                     ('Mass', float),
                     ('Px', float),
                     ('Py', float),
                     ('Pz', float),
                     ('Vx', float),
                     ('Vy', float),
                     ('Vz', float),
                     ] )

        particleData = np.array(particleData, dtype=mydt)

    return Timestep, numParticles, particleData

I try to run the function like this:

with open(fileOpenPath, 'r') as file:
    startWallTime = time.clock()

    Timestep, numParticles, particleData = readBlock(file)
    print(Timestep)

    ## Do processing stuff here 
    print("Timestep Processed")

    endWallTime = time.clock()

The problem is this only reads the first block of data from the file and stops there - I don't know how to make it loop through the file until it hits the end and stops.

Any suggestions on how to make this work would be great. I think I can write a way of doing it using single line processing with lots of if checks to see if i'm at the end of the timestep, but the simple function seemed easier and clearer.

解决方案

You can use the max_rows argument of numpy.genfromtxt:

with open("timesteps.dat", "rb") as f:
    while True:
        line = f.readline()
        if len(line) == 0:
            # End of file
            break
        # Skip blank lines
        while len(line.strip()) == 0:
            line = f.readline()
        line2_fields = f.readline().split()
        timestep = float(line2_fields[0])
        particles = int(line2_fields[1])
        data = np.genfromtxt(f, names=True, dtype=None, max_rows=particles)

        print("Timestep:", timestep)
        print("Particles:", particles)
        print("Data:")
        print(data)
        print()

Here's a sample file:

TIMESTEP  PARTICLES
0.00500103    4
ID  GROUP  VOLUME  MASS  PX  PY  PZ  VX  VY  VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
TIMESTEP  PARTICLES
0.00500103    5
ID  GROUP  VOLUME  MASS  PX  PY  PZ  VX  VY  VZ
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
652 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
431 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
385 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
972 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903

TIMESTEP  PARTICLES
0.00500103    3
ID  GROUP  VOLUME  MASS  PX  PY  PZ  VX  VY  VZ
222 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
333 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
444 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903

And here is the output:

Timestep: 0.00500103
Particles: 4
Data:
[ (651, 0, 5.23599e-07, 0.000397935, -0.084626, -0.0347849, 0.00188164, 0, 0, -1.04903)
 (430, 0, 5.23599e-07, 0.000397935, -0.0837742, -0.0442293, 0.0121046, 0, 0, -1.04903)
 (384, 0, 5.23599e-07, 0.000397935, -0.0749234, -0.0395652, 0.0143401, 0, 0, -1.04903)
 (971, 0, 5.23599e-07, 0.000397935, -0.0954931, -0.0159607, 0.0100155, 0, 0, -1.04903)]

Timestep: 0.00500103
Particles: 5
Data:
[ (971, 0, 5.23599e-07, 0.000397935, -0.0954931, -0.0159607, 0.0100155, 0, 0, -1.04903)
 (652, 0, 5.23599e-07, 0.000397935, -0.084626, -0.0347849, 0.00188164, 0, 0, -1.04903)
 (431, 0, 5.23599e-07, 0.000397935, -0.0837742, -0.0442293, 0.0121046, 0, 0, -1.04903)
 (385, 0, 5.23599e-07, 0.000397935, -0.0749234, -0.0395652, 0.0143401, 0, 0, -1.04903)
 (972, 0, 5.23599e-07, 0.000397935, -0.0954931, -0.0159607, 0.0100155, 0, 0, -1.04903)]

Timestep: 0.00500103
Particles: 3
Data:
[ (222, 0, 5.23599e-07, 0.000397935, -0.0837742, -0.0442293, 0.0121046, 0, 0, -1.04903)
 (333, 0, 5.23599e-07, 0.000397935, -0.0749234, -0.0395652, 0.0143401, 0, 0, -1.04903)
 (444, 0, 5.23599e-07, 0.000397935, -0.0954931, -0.0159607, 0.0100155, 0, 0, -1.04903)]

这篇关于Python函数在打开时从文件中读取可变长度的数据块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆