将数据帧拆分为多个数据帧 [英] Splitting dataframe into multiple dataframes

查看:39
本文介绍了将数据帧拆分为多个数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常大的数据框(大约 100 万行),其中包含来自实验(60 名受访者)的数据.

I have a very large dataframe (around 1 million rows) with data from an experiment (60 respondents).

我想将数据帧拆分为 60 个数据帧(每个参与者一个数据帧).

I would like to split the dataframe into 60 dataframes (a dataframe for each participant).

在数据帧data中,有一个名为'name'的变量,它是每个参与者的唯一代码.

In the dataframe, data, there is a variable called 'name', which is the unique code for each participant.

我尝试了以下操作,但没有任何反应(或者执行在一个小时内没有停止).我打算做的是将 data 拆分为较小的数据帧,并将它们附加到列表中 (datalist):

I have tried the following, but nothing happens (or execution does not stop within an hour). What I intend to do is to split the data into smaller dataframes, and append these to a list (datalist):

import pandas as pd

def splitframe(data, name='name'):
    
    n = data[name][0]

    df = pd.DataFrame(columns=data.columns)

    datalist = []

    for i in range(len(data)):
        if data[name][i] == n:
            df = df.append(data.iloc[i])
        else:
            datalist.append(df)
            df = pd.DataFrame(columns=data.columns)
            n = data[name][i]
            df = df.append(data.iloc[i])
        
    return datalist

我没有收到错误消息,脚本似乎永远运行!

I do not get an error message, the script just seems to run forever!

有没有聪明的方法来做到这一点?

Is there a smart way to do it?

推荐答案

首先,您的方法效率低下,因为逐行附加到列表会很慢,因为当空间不足时,它必须定期增加列表新条目,列表推导在这方面更好,因为大小是预先确定的并分配一次.

Firstly your approach is inefficient because the appending to the list on a row by basis will be slow as it has to periodically grow the list when there is insufficient space for the new entry, list comprehensions are better in this respect as the size is determined up front and allocated once.

但是,我认为从根本上说,您的方法有点浪费,因为您已经有了一个数据框,为什么要为这些用户中的每一个都创建一个新的?

However, I think fundamentally your approach is a little wasteful as you have a dataframe already so why create a new one for each of these users?

我会按列 'name' 对数据框进行排序,将索引设置为此,如果需要,不要删除该列.

I would sort the dataframe by column 'name', set the index to be this and if required not drop the column.

然后生成所有唯一条目的列表,然后您可以使用这些条目执行查找,至关重要的是,如果您只查询数据,请使用选择标准返回数据帧上的视图,而不会产生昂贵的数据复制.

Then generate a list of all the unique entries and then you can perform a lookup using these entries and crucially if you only querying the data, use the selection criteria to return a view on the dataframe without incurring a costly data copy.

使用pandas.DataFrame.sort_values<代码>pandas.DataFrame.set_index:

# sort the dataframe
df.sort_values(by='name', axis=1, inplace=True)

# set the index to be this and don't drop
df.set_index(keys=['name'], drop=False,inplace=True)

# get a list of names
names=df['name'].unique().tolist()

# now we can perform a lookup on a 'view' of the dataframe
joe = df.loc[df.name=='joe']

# now you can query all 'joes'

这篇关于将数据帧拆分为多个数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆