通过多列将csv文件拆分为panda数据帧 [英] Splitting a csv file into panda dataframe by multiple columns
问题描述
我有一个包含多列的tsv文件.有10列以上的列,但对我而言重要的列是名称为user_name,shift_id,url_id的列.我想创建一个数据框,该数据框首先根据user_names分隔整个csv文件,即仅将具有相同user_name的行组合在一起.从该块中,我创建另一个块,其中仅将具有特定shift_id的行组合在一起,然后从该块中,组成具有相同URL的块.不幸的是,由于公司规定,我无法共享数据,而创建一个虚构的数据表可能会更加令人困惑.
I have a tsv file with multiple columns. There are 10 and more columns but the columns important to me are the ones with the name user_name, shift_id, url_id. I want to create a data frame that first separates the entire csv file based on user_names i.e only rows with same user_name are grouped together. From that chunk I make another chunk where only rows with certain shift_id are grouped together and then from that chunk make a chunk with same url. I unfortunately cannot share the data because of the company rule and making an imaginary data table might be more confusing.
其他两列都有时间戳.我只想计算块的持续时间,但只能在根据这些列对块进行分组之后.
Two of the other columns have time-stamps. I want to calculate the time duration of the chunk but only after I group chunk according to those columns.
我看到了将数据帧按特定列值进行拆分的答案,但是在我的情况下,我有三个列值以及它们分开的顺序也很重要.
I have seen answers that split data-frame by a specific column value,but in my case I have three column values and the order in which they are separated matters too.
谢谢您的帮助!
推荐答案
假设您已将列读至dataframe
df = pd.DataFrame({'col1':[1,2,3], 'col2':[4,5,6],'col3':[7,8,9],
'col4':[1,2,3],'col5':[1,2,3],'col6':[1,2,3],
'col7':[1,2,3],'col8':[1,2,3],'col9':[1,2,3],
'col91':[1,2,3]})
print(df)
输出:
col1 col2 col3 col4 col5 col6 col7 col8 col9 col91
0 1 4 7 1 1 1 1 1 1 1
1 2 5 8 2 2 2 2 2 2 2
2 3 6 9 3 3 3 3 3 3 3
现在,我们只能选择三个感兴趣的列,将其设为col1, col2, and col3
Now, we can select only three columns of interest, let it be col1, col2, and col3
tmp_df = df[['col1', 'col2', 'col3']]
print(tmp_df)
输出:
col1 col2 col3
0 1 4 7
1 2 5 8
2 3 6 9
我们还要基于三个列值进行过滤:
Further we want to filter based on three column values:
final_df = tmp_df[(tmp_df.col1 == 1) & (tmp_df.col2 == 4) & (tmp_df.col3== 7)]
print(final_df)
输出:
col1 col2 col3
0 1 4 7
读取到dataframe
后,可以单行实现以上所有步骤:
After reading to dataframe
, all these above steps can be acheived in single line:
final = df[['col1', 'col2', 'col3']][(df.col1 == 1) & (df.col2 == 4) & (df.col3== 7)]
final
希望有帮助!
df = pd.DataFrame({'col1':[1,1,1,1,1], 'col2':[4,4,4,4,7],'col3':[7,7,9,7,7],
'col4':['X','X','X','X','X'],'col5':['X','X','X','X','X'],'col6':['X','X','X','X','X'],
'col7':['X','X','X','X','X'],'col8':['X','X','X','X','X'],'col9':['X','X','X','X','X'],
'col91':['X','X','X','X','X']})
print(df)
输出:
col1 col2 col3 col4 col5 col6 col7 col8 col9 col91
0 1 4 7 X X X X X X X
1 1 4 7 X X X X X X X
2 1 4 9 X X X X X X X
3 1 4 7 X X X X X X X
4 1 7 7 X X X X X X X
现在,使用与上面类似的遮罩:
Now, usinig similar masking as above:
final = df[(df.col1 == 1) & (df.col2 == 4) & (df.col3== 7)]
final
输出:
col1 col2 col3 col4 col5 col6 col7 col8 col9 col91
0 1 4 7 X X X X X X X
1 1 4 7 X X X X X X X
3 1 4 7 X X X X X X X
这篇关于通过多列将csv文件拆分为panda数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!