如何在阅读文本文件时纠正列之间的空格? [英] How to correct the spaces between the columns while reading a text file?

查看:0
本文介绍了如何在阅读文本文件时纠正列之间的空格?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从文本文件中读取数据并将其写入hdf5格式。但不知何故,在数据文件的中间,列之间的空格消失了。small part of the file数据如下:

Generated by trjconv : P/L=1/400 t=   0.00000
11214
    1P1     aP1    1  80.48  35.36   4.25
    2P1     aP1    2  37.45   3.92   3.96
    3P2     aP2    3  18.53  -9.69   4.68
    4P2     aP2    4  55.39  74.34   4.60
    5P3     aP3    5  22.11  68.71   3.85
.
.
 9994LI     aLI 9994  24.60  41.14   5.32
 9995LI     aLI 9995  88.47  43.02   5.72
 9996LI     aLI 9996  18.98  40.60   5.56
 9997LI     aLI 9997  35.63  46.43   5.68
 9998LI     aLI 9998  33.81  52.15   5.41
 9999LI     aLI 9999  38.72  57.18   5.32
10000LI     aLI10000  29.36  47.12   5.55
10001LI     aLI10001  82.55  44.80   5.50
10002LI     aLI10002  42.52  51.00   5.19
10003LI     aLI10003  28.61  40.21   5.70
10004LI     aLI10004  38.16  42.85   5.33
Generated by trjconv : P/L=1/400 t=   1000.00
11214
    1P1     aP1    1  80.48  35.36   4.25
    2P1     aP1    2  37.45   3.92   3.96
    3P2     aP2    3  18.53  -9.69   4.68
    4P2     aP2    4  55.39  74.34   4.60
    5P3     aP3    5  22.11  68.71   3.85
.
.
 9994LI     aLI 9994  24.60  41.14   5.32
 9995LI     aLI 9995  88.47  43.02   5.72
 9996LI     aLI 9996  18.98  40.60   5.56
 9997LI     aLI 9997  35.63  46.43   5.68
 9998LI     aLI 9998  33.81  52.15   5.41
 9999LI     aLI 9999  38.72  57.18   5.32
10000LI     aLI10000  29.36  47.12   5.55
10001LI     aLI10001  82.55  44.80   5.50
10002LI     aLI10002  42.52  51.00   5.19
10003LI     aLI10003  28.61  40.21   5.70
10004LI     aLI10004  38.16  42.85   5.33
..
..
..
数据是t=1000的帧的集合,具有一百万个帧。如您所见,在框的末尾,第2列和第3列互相接触。我想在读取数据的同时在它们之间创造空间。我遇到的另一个问题是由..生成的重复标题。由于h5文件不支持字符串,我如何将它们读写到hdf5文件中?有没有办法手动添加它们?代码如下:

import h5py
import numpy as np

#define a np.dtype for gro array/dataset (hard-coded for now)
gro_dt = np.dtype([('col1', 'S4'), ('col2', 'S4'), ('col3', int), 
                   ('col4', float), ('col5', float), ('col6', float)])

# Next, create an empty .h5 file with the dtype
with h5py.File('xaa.h5', 'w') as hdf:
    ds= hdf.create_dataset('dataset1', dtype=gro_dt, shape=(20,), maxshape=(None,)) 

    # Next read line 1 of .gro file
    f = open('xaa', 'r')
    data = f.readlines()
    ds.attrs["Source"]=data[0]
    f.close()

    # loop to read rows from 2 until end
    skip, incr, row0 = 2, 20, 0 
    read_gro = True
    while read_gro:
        arr = np.genfromtxt('xaa', skip_header=skip, max_rows=incr, dtype=gro_dt)
        rows = arr.shape[0]
        if rows == 0:
            read_gro = False 
        else:    
            if row0+rows > ds.shape[0] :
                ds.resize((row0+rows,))
            ds[row0:row0+rows] = arr
            skip += rows
            row0 += rows

我可以跳过第一个标头,但如何处理即将到来的标头?我可以提供行数的标题,如果有人需要。这些列引发值Error

 ValueError: Some errors were detected !
    Line #7 (got 5 columns instead of 6)
    Line #8 (got 5 columns instead of 6)
    Line #9 (got 5 columns instead of 6)

推荐答案

答案更新2021-09-09: 根据注释中的请求,我添加了两个使用f.readline()的新方法。一个使用索引对行进行切片,另一个使用struct包对字段进行解包。struct应该更快,但我没有看到测试文件的性能有显著差异(75个时间步长)。

此外,我还修改了代码,使其在文件末尾使用while True:和Break进行循环。这样就无需输入时间步数。

这是我根据您对上一个问题的回答所遇到的问题而编写的答案。(参考文献:Reading data from gromacs file and write it to the hdf5 file format。此答案使用readlines()将数据读取到列表中。(这可能是您的大文件的一个问题。如果是,则可以使用readline()将解决方案修改为逐行阅读。)它使用与字段宽度对齐的索引对每行上的数据进行切片。警告:阅读50e6行可能需要一段时间。注意:HDF5支持字符串(但h5py不支持NumPy Unicode字符串)。

方法一:使用和进程列表。
通过使用索引对每行进行切片来获取值:

import h5py
import numpy as np

csv_file = 'xaa.txt' # data from link in question

# define a np.dtype for gro array/dataset (hard-coded for now)
gro_dt = np.dtype([('col1', 'S7'), ('col2', 'S8'), ('col3', int), 
                   ('col4', float), ('col5', float), ('col6', float)])
 
c1, c2, c3, c4, c5 = 7, 15, 20, 27, 34
# The values above are used as indices to slice line 
# into the following fields in the loop on data[]: 
# [:7], [7:15], [15:20], [20:27], [27:34], [34:]

# Open the file for reading and
# create an empty .h5 file with the dtype above   
with open(csv_file, 'r') as f, 
     h5py.File('xaa.h5', 'w') as hdf:

    data = f.readlines()
    skip = 0
    step = 0
    while True:
        # Read text header line for THIS time step
        if skip == len(data):
            print("End Of File")
            break
        else:
            header = data[skip]
            print(header)
            skip += 1
        
        # get number of data rows
        no_rows = int(data[skip]) 
        skip += 1
        arr = np.empty(shape=(no_rows,), dtype=gro_dt)
        for row, line in enumerate(data[skip:skip+no_rows]):
            arr[row]['col1'] = line[:c1].strip()            
            arr[row]['col2'] = line[c1:c2].strip()            
            arr[row]['col3'] = int(line[c2:c3])
            arr[row]['col4'] = float(line[c3:c4])
            arr[row]['col5'] = float(line[c4:c5])
            arr[row]['col6'] = float(line[c5:])
            
        if arr.shape[0] > 0:
            # create a dataset for THIS time step
            ds= hdf.create_dataset(f'dataset_{step:04}', data=arr) 
            #create attributes for this dataset / time step
            hdr_tokens = header.split()
            ds.attrs['raw_header'] = header
            ds.attrs['Generated by'] = hdr_tokens[2]
            ds.attrs['P/L'] = hdr_tokens[4].split('=')[1]
            ds.attrs['Time'] = hdr_tokens[6]
            
        # increment by rows plus footer line that follows
        skip += 1 + no_rows

方法二:使用f.readline()逐行阅读
通过使用索引对每行进行切片来获取值:

import h5py
import numpy as np

csv_file = 'xaa.txt'

#define a np.dtype for gro array/dataset (hard-coded for now)
gro_dt = np.dtype([('col1', 'S7'), ('col2', 'S8'), ('col3', int), 
                   ('col4', float), ('col5', float), ('col6', float)])

## gro_fmt=[0:7], [7:15], [15:20], [20:27], [27:34], [34:41]
c1, c2, c3, c4, c5 = 7, 15, 20, 27, 34
# Open the file for reading and
# create an empty .h5 file with the dtype above   
with open(csv_file, 'r') as f, 
     h5py.File('xaa.h5', 'w') as hdf:

    step = 0
    while True:
    # Read text header line for THIS time step
        header = f.readline()
        if not header:
            print("End Of File")
            break
        else:
            print(header)

        # get number of data rows
        no_rows = int(f.readline()) 
        arr = np.empty(shape=(no_rows,), dtype=gro_dt)
        for row in range(no_rows):
            line = f.readline()
            arr[row]['col1'] = line[:c1].strip()            
            arr[row]['col2'] = line[c1:c2].strip()            
            arr[row]['col3'] = int(line[c2:c3])
            arr[row]['col4'] = float(line[c3:c4])
            arr[row]['col5'] = float(line[c4:c5])
            arr[row]['col6'] = float(line[c5:])
            
        if arr.shape[0] > 0:
            # create a dataset for THIS time step
            ds= hdf.create_dataset(f'dataset_{step:04}', data=arr) 
            #create attributes for this dataset / time step
            print(header)
            hdr_tokens = header.split()
            ds.attrs['raw_header'] = header
            ds.attrs['Generated by'] = hdr_tokens[2]
            ds.attrs['P/L'] = hdr_tokens[4].split('=')[1]
            ds.attrs['Time'] = hdr_tokens[6]
            
        footer = f.readline()
        step += 1

方法3:使用f.readlines()逐行阅读
使用structPackage从每行解包:

import struct
import numpy as np
import h5py

csv_file = 'xaa.txt'

fmtstring = '7s 8s 5s 7s 7s 7s'
fieldstruct = struct.Struct(fmtstring)
parse = fieldstruct.unpack_from

#define a np.dtype for gro array/dataset (hard-coded for now)
gro_dt = np.dtype([('col1', 'S7'), ('col2', 'S8'), ('col3', int), 
                   ('col4', float), ('col5', float), ('col6', float)])

with open(csv_file, 'r') as f, 
     h5py.File('xaa.h5', 'w') as hdf:
         
    step = 0
    while True:         
        header = f.readline()
        if not header:
            print("End Of File")
            break
        else:
            print(header)

        # get number of data rows
        no_rows = int(f.readline())
        arr = np.empty(shape=(no_rows,), dtype=gro_dt)
        for row in range(no_rows):
            fields = parse( f.readline().encode('utf-8') )
            arr[row]['col1'] = fields[0].strip()            
            arr[row]['col2'] = fields[1].strip()            
            arr[row]['col3'] = int(fields[2])
            arr[row]['col4'] = float(fields[3])
            arr[row]['col5'] = float(fields[4])
            arr[row]['col6'] = float(fields[5])
        if arr.shape[0] > 0:
            # create a dataset for THIS time step
            ds= hdf.create_dataset(f'dataset_{step:04}', data=arr) 
            #create attributes for this dataset / time step
            hdr_tokens = header.split()
            ds.attrs['raw_header'] = header
            ds.attrs['Generated by'] = hdr_tokens[2]
            ds.attrs['P/L'] = hdr_tokens[4].split('=')[1]
            ds.attrs['Time'] = hdr_tokens[6]
            
        footer = f.readline()
        step += 1

这篇关于如何在阅读文本文件时纠正列之间的空格?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆