用Python整形数据 [英] Shaping Data in Python

查看:94
本文介绍了用Python整形数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在使用Eyelink生成的数据. csv(从asc转换而来)基本上是一个较大的顺序列表,即未创建列,因此例如一行将具有"start_trial 1",而下一行将具有x和y坐标,而随后的N行也将在到达之前到"PreBeep1_1st_Sketchpad"行,最后到"start_trial 2"行.

I'm currently working with data generated by eyelink. The csv (transformed from asc) is basically one large sequential list, i.e. columns are not created, so for example a row will have 'start_trial 1' and the following row will have x and y coordinates and the following N rows will also before coming to 'PreBeep1_1st_Sketchpad' row and eventually 'start_trial 2' row.

我想知道是否有人对如何操纵这种堆叠"数据并将其转换为长格式数据有任何建议?

I was wondering if anyone has any advice on how to manipulate this 'stacked' data and transform it into long form data?

这是从csv中提取数据时的样子:

Here is what the data looks like when pulled from the csv:

MSG 12892743 start_trial    1   SCNB    
12892743      757.0   361.7  5916.0 ... SCNB    
MSG 12892744 PreBeep1_1st_Sketchpad SCNB
12892744      756.7   361.7  5920.0 ... SCNB    
12892745      756.1   362.2  5924.0 ... SCNB    
MSG 12892746 order of frames:   SCNB    
12892746      755.8   362.3  5928.0 ... SCNB    
12892747      756.7   362.3  5927.0 ... SCNB    
MSG 12892748 crosshair  SCNB    
12892748      757.8   361.8  5928.0 ... SCNB    
12892749      758.4   361.8  5930.0 ... SCNB    
MSG 12892750 sketchpad  SCNB    
12892750      758.1   361.7  5934.0 ... SCNB    
12892751      758.3   361.7  5938.0 ... SCNB    
MSG 12892752 sketchpad  SCNB    
12892752      759.1   361.9  5948.0 ... SCNB    
12892753      760.4   362.7  5956.0 ... SCNB    
MSG 12892754 sketchpad  SCNB    
12892754      761.7   363.5  5964.0 ... SCNB    
12892755      763.9   364.0  5966.0 ... SCNB    
MSG 12892756 buffer1    SCNB    
12892756      765.6   364.1  5970.0 ... SCNB    
12892757      766.2   364.3  5972.0 ... SCNB    
MSG 12892758 Diode1 SCNB    
12892758      765.2   364.3  5973.0 ... SCNB    
12892759      764.1   364.5  5964.0 ... SCNB    
12892760      763.9   364.7  5955.0 ... SCNB

理想情况下,我想为以下内容提供单独的列:

Ideally I'd like to have individual columns for:

Trial ID (SCNB shown above)
Frame ID (PreBeep1_1st_Sketchpad above)
X-CoOr (757.0 above)
Y-CoOr (361.7 above)
Time (5916.0 above)

如果有帮助,分隔符在csv文件中是\ t.

Delimiters are \t in the csv file if that helps.

可以看出,数据是从上到下按顺序逐行写入的,而不是按照我想要的形状组织成列.

As can be seen the data is written row-after-row sequentially from top-to-bottom instead of being organised into columns as I want to shape them.

"..."也是实际值.

the '...' are actual values also.

关于将包含诸如"start_trial"和"PreBeep1_1st_Sketchpad"之类的框架ID的列,理想情况下,我希望在该列中重复该框架的名称,直到遇到新的框架为止.

Regarding the column that will contain Frame IDs such as 'start_trial' and 'PreBeep1_1st_Sketchpad' I would ideally want the name of that frame repeated in the column until encountering a new one.

任何帮助或建议,将不胜感激.

Any help or advice would be greatly appreciated.

输出应如下所示:

Trial ID       Frame ID                 X-CoOr    Y-CoOr    Time 
  SCNB           Start_Trial              757.0    361.7    5916.0 
  SCNB           PreBeep1_1st_Sketchpad   756.7    361.7    5920.0
  SCNB           PreBeep1_1st_Sketchpad   756.1    362.2    5924.0

感谢您抽出宝贵的时间阅读.

Thanks for taking the time to read.

这是我正在使用的代码:

Here is the code I was working with:

file2 = open('P1E2E_Both_New_trial_data.csv', 'rb')
Long_Format = open('P1E2E_Long_Format.csv', 'w')
writer1 = csv.writer(Long_Format, delimiter = '\t')

#First create column headings
columns = ["Trial ID"] + ['Frame ID'] + ['X-CoOr'] + ['Y-CoOr'] + ['Time']
writer1.writerow(columns)

reader1 = csv.reader(file2, delimiter = '\t')

for row in reader1:
    # if statement here to skip blank lines
    if len(row) > 1:
        if 'start_trial' in row[1]:
            label = [row[3]] + ['start_trial']
            writer1.writerow(label)



file2.close()   # <---IMPORTANT
Long_Format.close()

上面的输出是:

Trial ID      Frame ID      X-CoOr     Y-CoOr     Time

SCNB          start_trial

RCL           start_trial

SCR           start_trial

...等等.

我的问题在于我不知道从这里到哪里.即使可行,我的方法也效率极低.我不知道如何告诉python在if语句中继续读取标签'Start_Trial'之后的行,以及如何在所述标签后的适当列中写入row [2]和row [3]的x和y CoOr值. .这有道理吗?

My problem lies in that I don't know where to go from here. My approach would be terribly inefficient even it were to work. I don't know how to tell python to continue reading the lines after the label 'Start_Trial' in the if statement and to write the x and y CoOr values from row[2] and row[3] in the appropriate columns after said label. Does that makes sense?

推荐答案

如果我们假设所有行都具有相同的Delemeter,则此问题并不像看起来那样严重.

If we assume that all lines have the same delemeter, this problem isn't as bad as it looks.

关键是要意识到所有框架线都以关键'MSG'开头:

The key is realizing that all of the frame lines start with the key 'MSG':

import csv
# Header values
FRAME_KEY = 'MSG'
FRAME_IDX = 0
TRIAL_ID_KEY = 'Trial ID'
TRIAL_ID_IDX = 3
FRAME_ID_KEY = 'Frame ID'
FRAME_ID_IDX = 2
# Data values
XCOR_KEY     = 'X-CoOr'
XCOR_IDX     = 1
YCOR_KEY     = 'Y-CoOr'
YCOR_IDX     = 2
TIME_KEY     = 'Time'
TIME_IDX     = 3

IN_DELIM = '\t'
OUT_DELIM= '\t'

OUT_HEADER = [TRIAL_ID_KEY, FRAME_ID_KEY, XCOR_KEY, YCOR_KEY, TIME_KEY]

with open('P1E2E_Both_New_trial_data.csv', 'rb') as in_file, open('P1E2E_Long_Format.csv') as out_file:
    in_reader = csv.reader(in_file, delimeter = IN_DELIM)
    out_writer= csv.DictWriter(out_file, OUT_HEADER, delimeter = OUT_DELIM)
    out_writer.writeheader()
    current_frame = None
    current_trial = None
    for row in in_reader:
        if row[FRAME_IDX] == FRAME_KEY:
            # Means we're at the start of a new frame
            current_frame = row[FRAME_ID_IDX]
            current_trial = row[TRIAL_ID_IDX]
        else:
            # Means we're in a data row
            out_row = dict()
            out_row[FRAME_ID_KEY] = current_frame
            out_row[TRIAL_ID_KEY] = current_trial
            out_row[XCOR_KEY]     = row[XCOR_IDX]
            out_row[YCOR_KEY]     = row[YCOR_IDX]
            out_row[TIME_KEY]     = row[TIME_IDX]
            out_writer.writerow(out_row)

基本上,当您按下'MSG'键连续一行时,您就知道要开始一个新的帧.否则,将数据写出. DictWriter使自动执行此操作变得容易,而不必担心顺序(该顺序由OUT_HEADER定义)

Basically, when you hit a row with the 'MSG' key, you know you're starting a new frame. Otherwise you write out the data. DictWriter makes it easy to do this automatically without having to worry about order (the order is defined by the OUT_HEADER)

这篇关于用Python整形数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆