从数据框列中随机选择行 [英] Randomly selecting rows from dataframe column

查看:141
本文介绍了从数据框列中随机选择行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于给定的dataframe列,我想随机选择大约60%并添加到新列,将剩余的40%添加到另一列,将40%列乘以(-1),然后创建一个新列像这样将它们合并在一起:

For a given dataframe column, I would like to randomly select roughly 60% and add to a new column, add the remaining 40% to another column, multiply the 40% column by (-1), and create a new column that merges these back together like so:

dict0 = {'x1': [1,2,3,4,5,6]}
data = pd.DataFrame(dict0)### 

dict1 = {'x1': [1,2,3,4,5,6],'x2': [1,'nan',3,'nan',5,6],'x3': ['nan',2,'nan',4,'nan','nan']}
data = pd.DataFrame(dict1)### 


dict2 = {'x1': [1,2,3,4,5,6],'x2': [1,'nan',3,'nan',5,6],'x3': ['nan',-2,'nan',-4,'nan','nan']}
data = pd.DataFrame(dict2)### 

dict3 = {'x1': [1,2,3,4,5,6],'x2': [1,'nan',3,'nan',5,6],'x3': ['nan',-2,'nan',-   4,'nan','nan'],,'x4': [1,-2,3,-4,5,6]}
data = pd.DataFrame(dict3)### 


推荐答案

虽然第一个答案提出了一种优雅的解决方案,但它扩展了规定的要求选择大约60%行。问题在于它不能保证60/40的分配。使用概率,选择的样本可能很容易全部为 1 或全部为 -1 ,实际上选择了所有行,而不是大约60%

While the first answer proposes an elegant solution, it stretches the stated requirement to select roughly 60% of the rows. The problem is that it doesn't guarantee a 60/40 distribution. Using probabilities, the selected samples could by chance easily be all 1 or all -1, in effect selecting all or no rows, not roughly 60%.

使用较大的数据框,发生这种情况的机会明显减少,但是它永远不会为零,并且在使用提供的示例数据进行尝试时会立即可见。

The chance of this to occur obviously decreases with larger dataframes, but it's never zero and is immediately visible when trying it with the provided example data.

如果这与您相关,请看一下这段代码, 保证行比率为60/40。

If this is relevant to you, take a look at this code, which does guarantee a 60/40 ratio of rows.

indices = np.random.choice(len(data), size=int(0.4 * len(data)), replace=False)
data['x4'] = np.where(data.index.isin(indices), -1 * data['x1'], data['x1'])

更新:一个回答,提出 df.sample 。实际上,它可以使您更加优雅地表达上述内容:

Update: One answer to your follow-up question proposes df.sample. Indeed, it lets you express the above much more elegantly:

indices = data.sample(frac=0.4).index
data['x4'] = np.where(data.index.isin(indices), -data['x1'], data['x1'])

这篇关于从数据框列中随机选择行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆