通过多列将csv文件拆分为panda数据帧 [英] Splitting a csv file into panda dataframe by multiple columns

查看:103
本文介绍了通过多列将csv文件拆分为panda数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含多列的tsv文件.有10列以上的列,但对我而言重要的列是名称为user_name,shift_id,url_id的列.我想创建一个数据框,该数据框首先根据user_names分隔整个csv文件,即仅将具有相同user_name的行组合在一起.从该块中,我创建另一个块,其中仅将具有特定shift_id的行组合在一起,然后从该块中,组成具有相同URL的块.不幸的是,由于公司规定,我无法共享数据,而创建一个虚构的数据表可能会更加令人困惑.

I have a tsv file with multiple columns. There are 10 and more columns but the columns important to me are the ones with the name user_name, shift_id, url_id. I want to create a data frame that first separates the entire csv file based on user_names i.e only rows with same user_name are grouped together. From that chunk I make another chunk where only rows with certain shift_id are grouped together and then from that chunk make a chunk with same url. I unfortunately cannot share the data because of the company rule and making an imaginary data table might be more confusing.

其他两列都有时间戳.我只想计算块的持续时间,但只能在根据这些列对块进行分组之后.

Two of the other columns have time-stamps. I want to calculate the time duration of the chunk but only after I group chunk according to those columns.

我看到了将数据帧按特定列值进行拆分的答案,但是在我的情况下,我有三个列值以及它们分开的顺序也很重要.

I have seen answers that split data-frame by a specific column value,but in my case I have three column values and the order in which they are separated matters too.

谢谢您的帮助!

推荐答案

假设您已将列读至dataframe

df = pd.DataFrame({'col1':[1,2,3], 'col2':[4,5,6],'col3':[7,8,9],
               'col4':[1,2,3],'col5':[1,2,3],'col6':[1,2,3],
               'col7':[1,2,3],'col8':[1,2,3],'col9':[1,2,3],
               'col91':[1,2,3]})
print(df)

输出:

     col1  col2  col3  col4  col5  col6  col7  col8  col9  col91
0     1     4     7     1     1     1     1     1     1      1
1     2     5     8     2     2     2     2     2     2      2
2     3     6     9     3     3     3     3     3     3      3

现在,我们只能选择三个感兴趣的列,将其设为col1, col2, and col3

Now, we can select only three columns of interest, let it be col1, col2, and col3

tmp_df = df[['col1', 'col2', 'col3']]
print(tmp_df)

输出:

     col1  col2  col3
0     1     4     7
1     2     5     8
2     3     6     9

我们还要基于三个列值进行过滤:

Further we want to filter based on three column values:

final_df = tmp_df[(tmp_df.col1 == 1) & (tmp_df.col2 == 4) & (tmp_df.col3== 7)]
print(final_df)

输出:

    col1  col2  col3
0     1     4     7

读取到dataframe后,可以单行实现以上所有步骤:

After reading to dataframe, all these above steps can be acheived in single line:

final = df[['col1', 'col2', 'col3']][(df.col1 == 1) & (df.col2 == 4) & (df.col3== 7)]
final

希望有帮助!

df = pd.DataFrame({'col1':[1,1,1,1,1], 'col2':[4,4,4,4,7],'col3':[7,7,9,7,7],
               'col4':['X','X','X','X','X'],'col5':['X','X','X','X','X'],'col6':['X','X','X','X','X'],
               'col7':['X','X','X','X','X'],'col8':['X','X','X','X','X'],'col9':['X','X','X','X','X'],
               'col91':['X','X','X','X','X']})
print(df)

输出:

     col1  col2  col3 col4 col5 col6 col7 col8 col9 col91
0     1     4     7    X    X    X    X    X    X     X
1     1     4     7    X    X    X    X    X    X     X
2     1     4     9    X    X    X    X    X    X     X
3     1     4     7    X    X    X    X    X    X     X
4     1     7     7    X    X    X    X    X    X     X

现在,使用与上面类似的遮罩:

Now, usinig similar masking as above:

final = df[(df.col1 == 1) & (df.col2 == 4) & (df.col3== 7)]
final

输出:

    col1  col2  col3 col4 col5 col6 col7 col8 col9 col91
0     1     4     7    X    X    X    X    X    X     X
1     1     4     7    X    X    X    X    X    X     X
3     1     4     7    X    X    X    X    X    X     X

这篇关于通过多列将csv文件拆分为panda数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆