Python:读取和写入复杂和重复格式的文件 [英] Python: Read and write the file of complex and reapeating format

查看:132
本文介绍了Python:读取和写入复杂和重复格式的文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先,对可怜的英国人感到抱歉。
我有一个重复格式的文件。如

  326迭代:0#债券:10 
1 6 7 14 54 70 77 0 0 0 0 0 1 0.693 0.632 0.847 0.750 0.644 0.000 0.000 0.000 0.000 0.000 3.566 0.000 0.028
2 6 3 6 15 55 0 0 0 0 0 0 1 0.925 0.920 0.909 0.892 0.000 0.000 0.000 0.000 0.000 0.000 3.645 0.000 -0.040
3 6 2 8 10 52 0 0 0 0 0 0 1 0.925 0.910 0.920 0.898 0.000 0.000 0.000 0.000 0.000 0.000 3.653 0.000 0.000
...
324 8 323 0 0 0 0 0 0 0 0 0 100 0.871 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.871 3.000 -0.493
325 2 326 0 0 0 0 0 0 0 0 0 101 0.930 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.930 0.000 0.334
326 8 325 0 0 0 0 0 0 0 0 0 101 0.930 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.930 3.000 -0.611
637.916060425841 306.094529423257 1250.10511927236
6.782126993565285E-006
326(重复此处)迭代次数:100次金钱:10
1 6 7 14 54 64 70 77 0 0 0 0 1 0.885 0.580 0.819 0.335 0.784 0.709 0.000 0.000 0.000 0.000 4.111 0.000 0.025
2 6 3 6 15 55 0 0 0 0 0 0 1 0.812 0.992 0.869 0.966 0.000 0.000 0.000 0.000 0.000 0.000 3.639 0.000 -0.034
3 6 2 8 10 52 0 0 0 0 0 0 1 0.812 0.966 0.989 0.926 0.000 0.000 0.000 0.000 0.000 0.000 3.692 0.000 0.004




  • 正如您在这里看到的,第一行是标题,第二〜327行是我想要的数据分析,第328和第329行有一些我不想要的数字o使用。接下来的框架从第330行开始,格式完全相同。这个框架重复超过20万次。

  • 我想使用每帧第2〜327行数据中的第1〜13列。另外我想使用头的第一个数字。
  • 我想分析数据,所有重复帧的第2〜327行的第3〜12列,打印数量为0,非0数据的数目的每个帧的目标矩阵。同样打印一些,第二和第十三列。因此,预期的输出文件变得像

      326 
    1
    1 6 5 5 1
    2 6 4 6 1
    ...
    325 2 1 9 101
    326 8 1 9 101
    326(下一帧从这里开始)
    2
    1 6 5 5 1
    2 6 4 6 1
    ...
    326
    3
    1 6 5 5 1
    2 6 4 6 1
    ...


  • 第一行:第一行的第一个数字。 b $ b
  • 第二行:帧号
  • 第三〜第328行:输入文件第一列,输入文件第二列,第三〜十二列非零输入,第3〜12列输入的零的数量,以及第13列的输入。

  • 从第四行开始:重复格式,与上面相同。因此,结果文件有2个标题行,分析了326行的数据,每帧总共328行。相同的格式也重复下一帧。使用该格式的结果数据(每个5个空格)建议将该文件用于其他目的。

    我使用的方法是,为13列创建13个数组 - >为每个帧使用double for each循环存储数据,每个328行。但我不知道如何处理输出。

    以下是我的试用代码(未完成,仅用于读取输入),但是这个代码有很多问题。 Linecache读取整行,而不是每个第一行的第一个数字。每帧有326 + 3 = 329行,但似乎我的代码不适合框架工作。我欢迎任何帮助,并协助分析这些数据。非常感谢你提前。

     #读取文件
    filename = raw_input(输入文件名\\\

    file = open(filename,'r')

    #从头文件中读取原子的数量
    import linecache
    nnn = linecache.getline(filename,1)
    natoms = int(nnn)
    singleframe = natoms + 3

    #获取帧数
    nlines = 0
    在文件i1中:
    nlines = nlines +1
    file.close()

    nframes = nlines / singleframe

    print'no lines are:',nlines
    print'no的框架是:',nframes
    print'原子是:',natoms

    #创建1d字符串数组
    nrange =范围(nlines)
    data_lines =存储整个输入文件到字符串数组
    file = open(filename,'r')
    i1 = 0
    for i1在nrange中:
    data_lines [i1] = file.readline()
    file.close()


    #创建1d数组来存储原子数据
    at_index = [None] * natoms
    at_type = [None] * natoms
    n 1 = [None] * natoms
    n2 = [None] * natoms
    n3 = [None] * natoms
    n4 = [None] * natoms
    n5 = [None] * natoms
    n6 = [None] * natoms
    n7 = [None] * natoms
    n8 = [None] * natoms
    n9 = [None] * natoms
    n10 = [None] * natoms
    molnr = [None] * natoms

    nrange1 =范围(natoms)
    nframe =范围(nframes)

    档案= open('output_force','w')
    print data_lines [9]
    for n1中的j1:
    start = j1 *(natoms + 3)+ 3
    for i1在nrange1中:
    line = data_lines [i1 + start] .split()#根据空格分隔每一行
    at_index [i1] = int(line [0])
    at_type [i1] = int(line [1])$ ​​b $ b n1 [i1] = int(line [2])
    n2 [i1] = int(line [3])
    n3 [i1] = int (line [4])
    n4 [i1] = int(line [5])
    n5 [i1] = int(line [6])
    n6 [i1] = int [b])
    n7 [i1] = int(line [8])
    n8 [i1] = int(line [9])
    n9 [i1] = int ])
    n10 [i1] = int(行[11])$ ​​b $ b molnr [i1] = int(line [12])


    解决方案

    当您使用csv文件时,应该查看 csv模块。我写了一个代码应该做的伎俩。

    这段代码假设好数据。如果您的数据集可能包含错误(例如列数少于13,或数据行少于326),则应进行一些更改。

    (更改为符合Python 2.6.6)

      import csv 
    with open('mydata.csv')as in_file:
    with open('outfile.csv','wb')as out_file:
    csv_reader = csv.reader(in_file,delimiter ='',skipinitialspace = True)
    csv_writer = csv.writer(out_file,delimiter ='\ t')

    #遍历文件中的所有行
    for i ,枚举(csv_reader)中的头文件:
    #获取头文件数据
    num =头文件[0]
    csv_writer.writerow([num])

    #从1开始(因此为+1部分)
    csv_writer.writerow([i + 1])$ ​​b
    $ b#遍历所有数据行
    for _ in xrange(326 ):

    #调用next(csv_reader)获得下一行
    #放入一个try ...除了避免StopIteration异常
    #如果文件结尾是f在达到326行之前
    尝试:
    row = next(csv_reader)
    除了StopIteration:
    break
    #使用list comprehension提取零数
    如果x.strip()=='0'])
    not_zeros = 10 - 零
    #将数据写入输出文件
    out = [row [0] .strip(),row [1] .strip(),not_zeros,zeros,row [12] .strip()]
    csv_writer.writerow(out)
    #如果
    else:
    #跳过文件的最后两行
    next(csv_reader)
    next(csv_reader)

    对于前三行来说,这将产生:

    pre $ 326
    1
    1 6 5 5 1
    2 6 4 6 1
    3 6 4 6 1


    To begin with, sorry for poor Engish. I have a file with repeating format. Such as

          326                                         Iteration:       0 #Bonds:       10
        1    6    7   14   54   70   77    0    0    0    0    0    1  0.693  0.632  0.847  0.750  0.644  0.000  0.000  0.000  0.000  0.000  3.566  0.000  0.028
        2    6    3    6   15   55    0    0    0    0    0    0    1  0.925  0.920  0.909  0.892  0.000  0.000  0.000  0.000  0.000  0.000  3.645  0.000 -0.040
        3    6    2    8   10   52    0    0    0    0    0    0    1  0.925  0.910  0.920  0.898  0.000  0.000  0.000  0.000  0.000  0.000  3.653  0.000  0.000
    ...
      324    8  323    0    0    0    0    0    0    0    0    0  100  0.871  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.871  3.000 -0.493
      325    2  326    0    0    0    0    0    0    0    0    0  101  0.930  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.930  0.000  0.334
      326    8  325    0    0    0    0    0    0    0    0    0  101  0.930  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.930  3.000 -0.611
       637.916060425841        306.094529423257        1250.10511927236
      6.782126993565285E-006
          326 (repeating from here)                   Iteration:     100 #Bonds:       10
        1    6    7   14   54   64   70   77    0    0    0    0    1  0.885  0.580  0.819  0.335  0.784  0.709  0.000  0.000  0.000  0.000  4.111  0.000  0.025
        2    6    3    6   15   55    0    0    0    0    0    0    1  0.812  0.992  0.869  0.966  0.000  0.000  0.000  0.000  0.000  0.000  3.639  0.000 -0.034
        3    6    2    8   10   52    0    0    0    0    0    0    1  0.812  0.966  0.989  0.926  0.000  0.000  0.000  0.000  0.000  0.000  3.692  0.000  0.004
    

    • As you can see here, the first line is the header, and 2nd~327th line is the data that I want to analyze, and 328th and 329th line have some numbers which I don't want to use. Next "frame" starts from line 330, with exactly same format. This "frame" repeats more than 200000 times.
    • I want to use 1st ~ 13th column from that 2nd~327th line data of each frames. Also I want to use first number of header.
    • I want to analyze the data, 3th~12th column of each 2nd~327th line of all repeating "frames", printing number of 0s and number of non-0s data from of target matrix of each frames. Also print some 1st, 2nd and 13th column as well. So the expected output file become like

      326
        1
      1    6    5    5    1
      2    6    4    6    1
      ...
      325  2    1    9  101
      326  8    1    9  101
      326 (Next frame starts from here)
        2
      1    6    5    5    1
      2    6    4    6    1
      ...
      326
        3
      1    6    5    5    1
      2    6    4    6    1
      ...
      

    • First line: First number of first line.
    • Second line: Frame number
    • 3rd~328th line: 1st column of input file, 2nd column of input file, number of non-zeros of 3th~12th column of input, number of zeros of 3th~12th column of input, and 13th column of input.
    • From 4th line: repeating format, same with above.

    So, the result file have 2 header line, and analyzed data of 326 lines, total 328 line per each frame. Same format repeats for next frame too. Using that format of result data (5 spaces each) is recommended to use the file for other purpose.

    The way I'm using is, Creating 13 arrays for 13 columns -> store data using double for loops for each frame, and each 328 lines. But I have no idea how can I deal with output.

    Following is the my trial code (unfinished, only for read the input), but this code have a lot of problems. Linecache reads whole line, not the first number of every first line. Every frame have 326+3=329 lines, but it seems like my code is not properly working for frame-wise workings. I welcomes any help and assist to analyze this data. Thank you very much in advance.

    # Read the file
    filename = raw_input("Enter the file name \n")
    file = open(filename, 'r')
    
    # Read the number of atom from header
    import linecache
    nnn = linecache.getline(filename, 1)
    natoms = int(nnn)
    singleframe = natoms + 3
    
    # get number of frames
    nlines = 0
    for i1 in file:
        nlines = nlines +1
    file.close()
    
    nframes = nlines / singleframe
    
    print 'no of lines are: ', nlines
    print 'no of frames are: ', nframes
    print 'no of atoms are:', natoms
    
    # Create 1d string array
    nrange = range(nlines)
    data_lines = [None]*(nlines)
    
    # Store whole input file into string array
    file = open(filename, 'r')
    i1=0
    for i1 in nrange:
        data_lines[i1] = file.readline()
    file.close()
    
    
    # Create 1d array to store atomic data
    at_index = [None]*natoms
    at_type = [None]*natoms
    n1 = [None]*natoms
    n2 = [None]*natoms
    n3 = [None]*natoms
    n4 = [None]*natoms
    n5 = [None]*natoms
    n6 = [None]*natoms
    n7 = [None]*natoms
    n8 = [None]*natoms
    n9 = [None]*natoms
    n10 = [None]*natoms
    molnr = [None]*natoms
    
    nrange1= range(natoms)
    nframe = range(nframes)
    
    file = open('output_force','w')
    print data_lines[9]
    for j1 in nframe:
        start = j1*(natoms + 3) + 3
        for i1 in nrange1:
            line = data_lines[i1+start].split()  #Split each line based on spaces
            at_index[i1] = int(line[0])
            at_type[i1] = int(line[1])
            n1[i1]= int(line[2])
            n2[i1]= int(line[3])
            n3[i1]= int(line[4])
            n4[i1]= int(line[5])
            n5[i1]= int(line[6])
            n6[i1]= int(line[7])
            n7[i1]= int(line[8])
            n8[i1]= int(line[9])
            n9[i1]= int(line[10])
            n10[i1]= int(line[11])
            molnr[i1]= int(line[12])
    

    解决方案

    When you are working with csv files, you should look into the csv module. I wrote a code that are should do the trick.

    This code assumes "good data". If your data set may contain errors (such as less columns than 13, or less data rows than 326) some alterations should be done.

    (changed to comply with Python 2.6.6)

    import csv
    with open('mydata.csv') as in_file:
        with open('outfile.csv', 'wb') as out_file:
            csv_reader = csv.reader(in_file, delimiter=' ', skipinitialspace=True)
            csv_writer = csv.writer(out_file, delimiter = '\t')
    
            # Iterate over all rows in the file
            for i, header in enumerate(csv_reader):
                # Get the header data
                num = header[0]
                csv_writer.writerow([num])
    
                # Write frame number, starting with 1 (hence the +1 part)
                csv_writer.writerow([i+1])
    
                # Iterate over all data rows
                for _ in xrange(326):
    
                    # Call next(csv_reader) to get the next row
                    # Put inside a try ... except to avoid StopIteration exception
                    # if end of file is found before reaching 326 lines
                    try:
                        row = next(csv_reader)
                    except StopIteration:
                        break
                    # Use list comprehension to extract number of zeros
                    zeros = sum([1 for x in row[2:12] if x.strip() == '0'])
                    not_zeros = 10 - zeros
                    # Write the data to output file
                    out = [row[0].strip(), row[1].strip(),not_zeros, zeros, row[12].strip()]
                    csv_writer.writerow(out)
                # If the
                else:
                    # Skip the last two lines of the file
                    next(csv_reader)
                    next(csv_reader)
    

    For the first three lines, this yields:

    326
    1
    1   6   5   5   1
    2   6   4   6   1
    3   6   4   6   1
    

    这篇关于Python:读取和写入复杂和重复格式的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆