用Python整形数据 [英] Shaping Data in Python
问题描述
我目前正在使用Eyelink生成的数据. csv(从asc转换而来)基本上是一个较大的顺序列表,即未创建列,因此例如一行将具有"start_trial 1",而下一行将具有x和y坐标,而随后的N行也将在到达之前到"PreBeep1_1st_Sketchpad"行,最后到"start_trial 2"行.
I'm currently working with data generated by eyelink. The csv (transformed from asc) is basically one large sequential list, i.e. columns are not created, so for example a row will have 'start_trial 1' and the following row will have x and y coordinates and the following N rows will also before coming to 'PreBeep1_1st_Sketchpad' row and eventually 'start_trial 2' row.
我想知道是否有人对如何操纵这种堆叠"数据并将其转换为长格式数据有任何建议?
I was wondering if anyone has any advice on how to manipulate this 'stacked' data and transform it into long form data?
这是从csv中提取数据时的样子:
Here is what the data looks like when pulled from the csv:
MSG 12892743 start_trial 1 SCNB
12892743 757.0 361.7 5916.0 ... SCNB
MSG 12892744 PreBeep1_1st_Sketchpad SCNB
12892744 756.7 361.7 5920.0 ... SCNB
12892745 756.1 362.2 5924.0 ... SCNB
MSG 12892746 order of frames: SCNB
12892746 755.8 362.3 5928.0 ... SCNB
12892747 756.7 362.3 5927.0 ... SCNB
MSG 12892748 crosshair SCNB
12892748 757.8 361.8 5928.0 ... SCNB
12892749 758.4 361.8 5930.0 ... SCNB
MSG 12892750 sketchpad SCNB
12892750 758.1 361.7 5934.0 ... SCNB
12892751 758.3 361.7 5938.0 ... SCNB
MSG 12892752 sketchpad SCNB
12892752 759.1 361.9 5948.0 ... SCNB
12892753 760.4 362.7 5956.0 ... SCNB
MSG 12892754 sketchpad SCNB
12892754 761.7 363.5 5964.0 ... SCNB
12892755 763.9 364.0 5966.0 ... SCNB
MSG 12892756 buffer1 SCNB
12892756 765.6 364.1 5970.0 ... SCNB
12892757 766.2 364.3 5972.0 ... SCNB
MSG 12892758 Diode1 SCNB
12892758 765.2 364.3 5973.0 ... SCNB
12892759 764.1 364.5 5964.0 ... SCNB
12892760 763.9 364.7 5955.0 ... SCNB
理想情况下,我想为以下内容提供单独的列:
Ideally I'd like to have individual columns for:
Trial ID (SCNB shown above)
Frame ID (PreBeep1_1st_Sketchpad above)
X-CoOr (757.0 above)
Y-CoOr (361.7 above)
Time (5916.0 above)
如果有帮助,分隔符在csv文件中是\ t.
Delimiters are \t in the csv file if that helps.
可以看出,数据是从上到下按顺序逐行写入的,而不是按照我想要的形状组织成列.
As can be seen the data is written row-after-row sequentially from top-to-bottom instead of being organised into columns as I want to shape them.
"..."也是实际值.
the '...' are actual values also.
关于将包含诸如"start_trial"和"PreBeep1_1st_Sketchpad"之类的框架ID的列,理想情况下,我希望在该列中重复该框架的名称,直到遇到新的框架为止.
Regarding the column that will contain Frame IDs such as 'start_trial' and 'PreBeep1_1st_Sketchpad' I would ideally want the name of that frame repeated in the column until encountering a new one.
任何帮助或建议,将不胜感激.
Any help or advice would be greatly appreciated.
输出应如下所示:
Trial ID Frame ID X-CoOr Y-CoOr Time
SCNB Start_Trial 757.0 361.7 5916.0
SCNB PreBeep1_1st_Sketchpad 756.7 361.7 5920.0
SCNB PreBeep1_1st_Sketchpad 756.1 362.2 5924.0
感谢您抽出宝贵的时间阅读.
Thanks for taking the time to read.
这是我正在使用的代码:
Here is the code I was working with:
file2 = open('P1E2E_Both_New_trial_data.csv', 'rb')
Long_Format = open('P1E2E_Long_Format.csv', 'w')
writer1 = csv.writer(Long_Format, delimiter = '\t')
#First create column headings
columns = ["Trial ID"] + ['Frame ID'] + ['X-CoOr'] + ['Y-CoOr'] + ['Time']
writer1.writerow(columns)
reader1 = csv.reader(file2, delimiter = '\t')
for row in reader1:
# if statement here to skip blank lines
if len(row) > 1:
if 'start_trial' in row[1]:
label = [row[3]] + ['start_trial']
writer1.writerow(label)
file2.close() # <---IMPORTANT
Long_Format.close()
上面的输出是:
Trial ID Frame ID X-CoOr Y-CoOr Time
SCNB start_trial
RCL start_trial
SCR start_trial
...等等.
我的问题在于我不知道从这里到哪里.即使可行,我的方法也效率极低.我不知道如何告诉python在if语句中继续读取标签'Start_Trial'之后的行,以及如何在所述标签后的适当列中写入row [2]和row [3]的x和y CoOr值. .这有道理吗?
My problem lies in that I don't know where to go from here. My approach would be terribly inefficient even it were to work. I don't know how to tell python to continue reading the lines after the label 'Start_Trial' in the if statement and to write the x and y CoOr values from row[2] and row[3] in the appropriate columns after said label. Does that makes sense?
推荐答案
如果我们假设所有行都具有相同的Delemeter,则此问题并不像看起来那样严重.
If we assume that all lines have the same delemeter, this problem isn't as bad as it looks.
关键是要意识到所有框架线都以关键'MSG'
开头:
The key is realizing that all of the frame lines start with the key 'MSG'
:
import csv
# Header values
FRAME_KEY = 'MSG'
FRAME_IDX = 0
TRIAL_ID_KEY = 'Trial ID'
TRIAL_ID_IDX = 3
FRAME_ID_KEY = 'Frame ID'
FRAME_ID_IDX = 2
# Data values
XCOR_KEY = 'X-CoOr'
XCOR_IDX = 1
YCOR_KEY = 'Y-CoOr'
YCOR_IDX = 2
TIME_KEY = 'Time'
TIME_IDX = 3
IN_DELIM = '\t'
OUT_DELIM= '\t'
OUT_HEADER = [TRIAL_ID_KEY, FRAME_ID_KEY, XCOR_KEY, YCOR_KEY, TIME_KEY]
with open('P1E2E_Both_New_trial_data.csv', 'rb') as in_file, open('P1E2E_Long_Format.csv') as out_file:
in_reader = csv.reader(in_file, delimeter = IN_DELIM)
out_writer= csv.DictWriter(out_file, OUT_HEADER, delimeter = OUT_DELIM)
out_writer.writeheader()
current_frame = None
current_trial = None
for row in in_reader:
if row[FRAME_IDX] == FRAME_KEY:
# Means we're at the start of a new frame
current_frame = row[FRAME_ID_IDX]
current_trial = row[TRIAL_ID_IDX]
else:
# Means we're in a data row
out_row = dict()
out_row[FRAME_ID_KEY] = current_frame
out_row[TRIAL_ID_KEY] = current_trial
out_row[XCOR_KEY] = row[XCOR_IDX]
out_row[YCOR_KEY] = row[YCOR_IDX]
out_row[TIME_KEY] = row[TIME_IDX]
out_writer.writerow(out_row)
基本上,当您按下'MSG'
键连续一行时,您就知道要开始一个新的帧.否则,将数据写出. DictWriter
使自动执行此操作变得容易,而不必担心顺序(该顺序由OUT_HEADER
定义)
Basically, when you hit a row with the 'MSG'
key, you know you're starting a new frame. Otherwise you write out the data. DictWriter
makes it easy to do this automatically without having to worry about order (the order is defined by the OUT_HEADER
)
这篇关于用Python整形数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!