如何从数据框中随机删除每个标签中的行? [英] How to remove, randomly, rows from a dataframe but from each label?
本文介绍了如何从数据框中随机删除每个标签中的行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
这是一个机器学习项目.
This is for a machine learning project.
我有一个数据框,其中有5列作为要素,而1列作为标签(图A).
I have a dataframe with 5 columns as features and 1 column as label (Figure A).
我想从每个标签中随机删除2行. 因此,有12行(每个标签4行);我将得到6行(每个标签2行)(图B).
I want to randomly remove 2 rows but from each label. So, as there are 12 rows (4 for each label); I will end up with 6 rows (2 from each label) (Figure B).
我该怎么办?仅使用numpy会更容易吗?
How can I do it? Would it be easier to do it with only numpy?
图A
图B
这是我的代码:
# THIS IS FOR FIGURE A
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(12, 5))
label=np.array([1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3])
df['label'] = label
df.index=['s1', 's1', 's1', 's1', 's2', 's2', 's2', 's2', 's3', 's3', 's3', 's3']
df
#THIS IS MY ATTEMPT FOR FIGURE B
dfs = df.sample(n=2)
dfs
推荐答案
使用groupby.apply:
With groupby.apply:
df.groupby('label', as_index=False).apply(lambda x: x.sample(2)) \
.reset_index(level=0, drop=True)
Out:
0 1 2 3 4 label
s1 0.433731 0.886622 0.683993 0.125918 0.398787 1
s1 0.719834 0.435971 0.935742 0.885779 0.460693 1
s2 0.324877 0.962413 0.366274 0.980935 0.487806 2
s2 0.600318 0.633574 0.453003 0.291159 0.223662 2
s3 0.741116 0.167992 0.513374 0.485132 0.550467 3
s3 0.301959 0.843531 0.654343 0.726779 0.594402 3
我认为一种更清晰的理解方式是:
A cleaner way in my opinion would be with a comprehension:
pd.concat(g.sample(2) for idx, g in df.groupby('label'))
这将产生相同的结果:
0 1 2 3 4 label
s1 0.442293 0.470318 0.559764 0.829743 0.146971 1
s1 0.603235 0.218269 0.516422 0.295342 0.466475 1
s2 0.569428 0.109494 0.035729 0.548579 0.760698 2
s2 0.600318 0.633574 0.453003 0.291159 0.223662 2
s3 0.412750 0.079504 0.433272 0.136108 0.740311 3
s3 0.462627 0.025328 0.245863 0.931857 0.576927 3
这篇关于如何从数据框中随机删除每个标签中的行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文