随机排列 DataFrame 行 [英] Shuffle DataFrame rows

查看:28
本文介绍了随机排列 DataFrame 行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下数据帧:

    Col1  Col2  Col3  Type
0      1     2     3     1
1      4     5     6     1
...
20     7     8     9     2
21    10    11    12     2
...
45    13    14    15     3
46    16    17    18     3
...

DataFrame 是从 csv 文件中读取的.Type 1 的所有行在最上面,然后是 Type 2 的行,然后是 Type 3 的行,依此类推

The DataFrame is read from a csv file. All rows which have Type 1 are on top, followed by the rows with Type 2, followed by the rows with Type 3, etc.

我想打乱 DataFrame 行的顺序,以便混合所有 Type.可能的结果是:

I would like to shuffle the order of the DataFrame's rows, so that all Type's are mixed. A possible result could be:

    Col1  Col2  Col3  Type
0      7     8     9     2
1     13    14    15     3
...
20     1     2     3     1
21    10    11    12     2
...
45     4     5     6     1
46    16    17    18     3
...

我怎样才能做到这一点?

How can I achieve this?

推荐答案

Pandas 的惯用方法是使用 .sample 数据框的方法,用于在不替换的情况下对所有行进行采样:

The idiomatic way to do this with Pandas is to use the .sample method of your dataframe to sample all rows without replacement:

df.sample(frac=1)

frac 关键字参数指定要在随机样本中返回的行的分数,因此 frac=1 表示返回所有行(以随机顺序).

The frac keyword argument specifies the fraction of rows to return in the random sample, so frac=1 means return all rows (in random order).

注意:如果您希望就地改组数据帧并重置索引,您可以执行例如

Note: If you wish to shuffle your dataframe in-place and reset the index, you could do e.g.

df = df.sample(frac=1).reset_index(drop=True)

此处,指定 drop=True 可防止 .reset_index 创建包含旧索引条目的列.

Here, specifying drop=True prevents .reset_index from creating a column containing the old index entries.

后续注意事项:虽然看起来上面的操作可能不是就地,但是python/pandas足够聪明,不会为shuffled做另一个malloc目的.也就是说,即使 reference 对象已更改(我的意思是 id(df_old)id(df_new) 不同),底层的C对象还是一样的.为了证明确实如此,您可以运行一个简单的内存分析器:

Follow-up note: Although it may not look like the above operation is in-place, python/pandas is smart enough not to do another malloc for the shuffled object. That is, even though the reference object has changed (by which I mean id(df_old) is not the same as id(df_new)), the underlying C object is still the same. To show that this is indeed the case, you could run a simple memory profiler:

$ python3 -m memory_profiler .	est.py
Filename: .	est.py

Line #    Mem usage    Increment   Line Contents
================================================
     5     68.5 MiB     68.5 MiB   @profile
     6                             def shuffle():
     7    847.8 MiB    779.3 MiB       df = pd.DataFrame(np.random.randn(100, 1000000))
     8    847.9 MiB      0.1 MiB       df = df.sample(frac=1).reset_index(drop=True)

这篇关于随机排列 DataFrame 行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆