对 Pandas Dataframe 进行采样的最快方法? [英] Fastest way to sample Pandas Dataframe?

查看:51
本文介绍了对 Pandas Dataframe 进行采样的最快方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先,我想从三个数据帧(每个 150 行)中随机抽取样本并合并结果.其次,我想尽可能多地重复这个过程.

First, I want to take random samples from three dataframes (150 rows each) and concat the results. Second, I want to repeat this process as many times as possible.

对于第 1 部分,我使用以下函数:

For part 1 I use the following function:

def get_sample(n_A, n_B, n_C):
    A = df_A.sample(n = n_A, replace=False)
    B = df_B.sample(n = n_B, replace=False)
    C = df_C.sample(n = n_C, replace=False)
    return pd.concat([A, B, C])

对于第 2 部分,我使用以下行:

For part 2 I use the following line:

results = [get_sample(5,5,3) for i in range(n)] 

目前使用 n = 50.000 在我的 MacBook 上进行分析大约需要 1 分 40 秒.欢迎任何有关如何提高此过程速度的建议!

Currently with n = 50.000 the analysis takes about 1 minute and 40 seconds on my MacBook. Any advise on how to improve the speed of this process is welcome!

PM 三个数据帧(df_A、df_B、df_C)仅在一个分类特征上有所不同.挑战在于我想要来自每个类别的特定数量的样本.

PM the three dataframes (df_A, df_B, df_C) differ only in one categorical feature. The challenge is that I want a specific number samples from each category.

推荐答案

在您的情况下,使用 numpy 数组而不是 Pandas 数据框应该是值得的(Leevo 已经指出).

In your case it should pay off to work with numpy arrays instead of pandas dataframes (as noted already by Leevo).

Numpy 数组是比 Pandas 数据帧更简单的对象(numpy 数组中没有行/列标签就是一个主要例子).因此,numpy 数组允许更快地执行诸如连接之类的操作.当您在一个较大的脚本中只执行几个串联时,时间差通常可以忽略不计.但是,在您在多次迭代循环中进行串联的情况下,时间差异可能会累积并变得显着.

Numpy arrays are simpler objects than pandas dataframes (the absence of row/column labels in numpy arrays is a prime example). As a result numpy arrays allow operations such as concatenation to be performed faster. The time difference is usually negligible when you're performing just a few concatenations within a larger script. However in your case where you're doing concatenations within a many-iterations loop, time differences can accumulate and become significant.

尝试以下操作:

import pandas as pd
import numpy as np

# Initialize example dataframes
df_A = pd.DataFrame(np.random.rand(150, 10))
df_B = pd.DataFrame(np.random.rand(150, 10))
df_C = pd.DataFrame(np.random.rand(150, 10))

# Initialize constants
n_A = 5
n_B = 5
n_C = 3
n = 10000

# Reduce dataframes to numpy arrays
arr_A = df_A.values
arr_B = df_B.values
arr_C = df_C.values

# Perform sampling on numpy arrays
def get_sample():
    A = arr_A[np.random.choice(arr_A.shape[0], n_A, replace=False)]
    B = arr_B[np.random.choice(arr_B.shape[0], n_B, replace=False)]
    C = arr_C[np.random.choice(arr_C.shape[0], n_C, replace=False)]
    return np.concatenate([A, B, C])
results = [get_sample() for i in range(n)]

这篇关于对 Pandas Dataframe 进行采样的最快方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆