随机排列 DataFrame 行 [英] Shuffle DataFrame rows
问题描述
我有以下数据帧:
Col1 Col2 Col3 Type
0 1 2 3 1
1 4 5 6 1
...
20 7 8 9 2
21 10 11 12 2
...
45 13 14 15 3
46 16 17 18 3
...
DataFrame 是从 csv 文件中读取的.Type
1 的所有行在最上面,然后是 Type
2 的行,然后是 Type
3 的行,依此类推
The DataFrame is read from a csv file. All rows which have Type
1 are on top, followed by the rows with Type
2, followed by the rows with Type
3, etc.
我想打乱 DataFrame 行的顺序,以便混合所有 Type
.可能的结果是:
I would like to shuffle the order of the DataFrame's rows, so that all Type
's are mixed. A possible result could be:
Col1 Col2 Col3 Type
0 7 8 9 2
1 13 14 15 3
...
20 1 2 3 1
21 10 11 12 2
...
45 4 5 6 1
46 16 17 18 3
...
我怎样才能做到这一点?
How can I achieve this?
推荐答案
Pandas 的惯用方法是使用 .sample
数据框的方法,用于在不替换的情况下对所有行进行采样:
The idiomatic way to do this with Pandas is to use the .sample
method of your dataframe to sample all rows without replacement:
df.sample(frac=1)
frac
关键字参数指定要在随机样本中返回的行的分数,因此 frac=1
表示返回所有行(以随机顺序).
The frac
keyword argument specifies the fraction of rows to return in the random sample, so frac=1
means return all rows (in random order).
注意:如果您希望就地改组数据帧并重置索引,您可以执行例如
Note: If you wish to shuffle your dataframe in-place and reset the index, you could do e.g.
df = df.sample(frac=1).reset_index(drop=True)
此处,指定 drop=True
可防止 .reset_index
创建包含旧索引条目的列.
Here, specifying drop=True
prevents .reset_index
from creating a column containing the old index entries.
后续注意事项:虽然看起来上面的操作可能不是就地,但是python/pandas足够聪明,不会为shuffled做另一个malloc目的.也就是说,即使 reference 对象已更改(我的意思是 id(df_old)
与 id(df_new)
不同),底层的C对象还是一样的.为了证明确实如此,您可以运行一个简单的内存分析器:
Follow-up note: Although it may not look like the above operation is in-place, python/pandas is smart enough not to do another malloc for the shuffled object. That is, even though the reference object has changed (by which I mean id(df_old)
is not the same as id(df_new)
), the underlying C object is still the same. To show that this is indeed the case, you could run a simple memory profiler:
$ python3 -m memory_profiler . est.py
Filename: . est.py
Line # Mem usage Increment Line Contents
================================================
5 68.5 MiB 68.5 MiB @profile
6 def shuffle():
7 847.8 MiB 779.3 MiB df = pd.DataFrame(np.random.randn(100, 1000000))
8 847.9 MiB 0.1 MiB df = df.sample(frac=1).reset_index(drop=True)
这篇关于随机排列 DataFrame 行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!