以相同方式对两个 pandas 数据帧进行采样 [英] Sample two pandas dataframes the same way

查看:66
本文介绍了以相同方式对两个 pandas 数据帧进行采样的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在做一个机器学习计算,它有两个数据帧-一个用于因子,另一个用于目标值.我必须将其分为训练和测试两个部分.在我看来,我已经找到了方法,但是我正在寻找更优雅的解决方案.这是我的代码:

I'm doing a machine learning computations having two dataframes - one for factors and other one for target values. I have to split both into training and testing parts. It seems to me that I've found the way but I'm looking for more elegant solution. Here is my code:

import pandas as pd
import numpy as np
import random

df_source = pd.DataFrame(np.random.randn(5,2),index = range(0,10,2), columns=list('AB'))
df_target = pd.DataFrame(np.random.randn(5,2),index = range(0,10,2), columns=list('CD'))

rows = np.asarray(random.sample(range(0, len(df_source)), 2))

df_source_train = df_source.iloc[rows]
df_source_test = df_source[~df_source.index.isin(df_source_train.index)]
df_target_train = df_target.iloc[rows]
df_target_test = df_target[~df_target.index.isin(df_target_train.index)]

print('rows')
print(rows)
print('source')
print(df_source)
print('source train')
print(df_source_train)
print('source_test')
print(df_source_test)

----已编辑-unutbu解决方案(已修正)---

---- edited - solution by unutbu (midified) ---

np.random.seed(2013)
percentile = .6
rows = np.random.binomial(1, percentile, size=len(df_source)).astype(bool)

df_source_train = df_source[rows]
df_source_test = df_source[~rows]
df_target_train = df_target[rows]
df_target_test = df_target[~rows]

推荐答案

如果将rows设置为长度为len(df)的布尔数组,则可以使用df[rows]获取True行并获取带有df[~rows]的行:

If you make rows a boolean array of length len(df), then you can get the True rows with df[rows] and get the False rows with df[~rows]:

import pandas as pd
import numpy as np
import random
np.random.seed(2013)

df_source = pd.DataFrame(
    np.random.randn(5, 2), index=range(0, 10, 2), columns=list('AB'))

rows = np.random.randint(2, size=len(df_source)).astype('bool')

df_source_train = df_source[rows]
df_source_test = df_source[~rows]

print(rows)
# [ True  True False  True False]

# if for some reason you need the index values of where `rows` is True
print(np.where(rows))  
# (array([0, 1, 3]),)

print(df_source)
#           A         B
# 0  0.279545  0.107474
# 2  0.651458 -1.516999
# 4 -1.320541  0.679631
# 6  0.833612  0.492572
# 8  1.555721  1.741279

print(df_source_train)
#           A         B
# 0  0.279545  0.107474
# 2  0.651458 -1.516999
# 6  0.833612  0.492572

print(df_source_test)
#           A         B
# 4 -1.320541  0.679631
# 8  1.555721  1.741279

这篇关于以相同方式对两个 pandas 数据帧进行采样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆