为什么要在reset_index之前使用to_frame? [英] Why use to_frame before reset_index?

查看:150
本文介绍了为什么要在reset_index之前使用to_frame?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用这样的数据集

df = pd.DataFrame(np.random.randint(0,5,size=(20, 3)), columns=['user_id','module_id','week'])

我们经常看到这种模式:

we often see this pattern:

df.groupby(['user_id'])['module_id'].count().to_frame().reset_index().rename({'module_id':'count'}, axis='columns')

但是我们从

df.groupby(['user_id'])['module_id'].count().reset_index(name='count')

(注意,我们在前一个版本中需要附加的rename,因为Series上的reset_index(

(N.B. we need the additional rename in the former because reset_index on Series (here) includes a name parameter and returns a data frame, while reset_index on DataFrame (here) does not include the name parameter.)

首先使用to_frame有什么优势吗?

Is there any advantage in using to_frame first?

(我想知道这是否可能是早期版本的熊猫的人工制品,但看起来不太可能:

(I wondered if it might be an artefact of earlier versions of pandas, but that looks unlikely:

  • Series.reset_index was added in this commit on the 27th of January 2012.
  • Series.to_frame was added in this commit on the 13th of October 2013.

因此Series.reset_index可以在Series.to_frame之前的一年内使用.)

So Series.reset_index was available over a year before Series.to_frame.)

推荐答案

使用to_frame()没有明显的优势.两种方法均可用于获得相同的结果.在大熊猫中,通常使用多种方法来解决问题.我能想到的唯一优点是,对于较大的数据集,在重置索引之前先具有数据框视图可能更方便.如果以您的数据框为例,您会发现to_frame()显示了一个数据框视图,该视图对于根据整洁的数据框表v/s和count系列而言可能对理解数据很有用.另外,to_frame()的用法对于初次查看您的代码的新用户来说,意图更加清晰.

There is no noticeable advantage of using to_frame(). Both approaches can be used to achieve the same result. It is common in pandas to use multiple approaches for solving a problem. The only advantage I can think of is that for larger sets of data, it maybe more convenient to have a dataframe view first before resetting the index. If we take your dataframe as an example, you will find that to_frame() displays a dataframe view that maybe useful to understand the data in terms of a neat dataframe table v/s a count series. Also, the usage of to_frame() makes the intent more clear to a new user who looks at your code for the first time.

示例数据框:

In [7]: df = pd.DataFrame(np.random.randint(0,5,size=(20, 3)), columns=['user_i
   ...: d','module_id','week'])

In [8]: df.head()
Out[8]:
   user_id  module_id  week
0        3          4     4
1        1          3     4
2        1          2     2
3        1          3     4
4        1          2     2

count()函数返回一个Series:

The count() function returns a Series:

In [18]: test1 = df.groupby(['user_id'])['module_id'].count()

In [19]: type(test1)
Out[19]: pandas.core.series.Series

In [20]: test1
Out[20]:
user_id
0    2
1    7
2    4
3    6
4    1
Name: module_id, dtype: int64

In [21]: test1.index
Out[21]: Int64Index([0, 1, 2, 3, 4], dtype='int64', name='user_id')

使用to_frame可以明确表明您打算将Series转换为Dataframe.此处的索引为user_id:

Using to_frame makes it explicit that you intend to convert the Series to a Dataframe. The index here is user_id:

In [22]: test1.to_frame()
Out[22]:
         module_id
user_id
0                2
1                7
2                4
3                6
4                1

现在,我们重置索引并使用Dataframe.rename重命名该列.正如您正确指出的那样,Dataframe.reset_index()没有一个name参数,因此,我们将必须显式重命名该列.

And now we reset the index and rename the column using Dataframe.rename. As you rightly pointed, Dataframe.reset_index() does not have a name parameter and therefore, we will have to rename the column explicitly.

In [24]: testdf1 = test1.to_frame().reset_index().rename({'module_id':'count'}, axis='columns')

In [25]: testdf1
Out[25]:
   user_id  count
0        0      2
1        1      7
2        2      4
3        3      6
4        4      1

现在让我们看看另一种情况.我们将使用相同的count()系列test1,但将其重命名为test2以区分这两种方法.换句话说,test1等于test2.

Now lets look at the other case. We will use the same count() series test1 but rename it as test2 to differentiate between the two approaches. In other words, test1 is equal to test2.

In [26]: test2 = df.groupby(['user_id'])['module_id'].count()

In [27]: test2
Out[27]:
user_id
0    2
1    7
2    4
3    6
4    1
Name: module_id, dtype: int64

In [28]: test2.reset_index()
Out[28]:
   user_id  module_id
0        0          2
1        1          7
2        2          4
3        3          6
4        4          1

In [30]: testdf2 = test2.reset_index(name='count')

In [31]: testdf1 == testdf2
Out[31]:
   user_id  count
0     True   True
1     True   True
2     True   True
3     True   True
4     True   True

如您所见,两个数据帧都是等效的,在第二种方法中,我们只需要使用reset_index(name='count')重设索引并重命名列名,因为Series.reset_index()确实具有name参数.

As you can see both dataframes are equivalent, and in the second approach we just had to use reset_index(name='count') to both reset the index and rename the column name because Series.reset_index() does have a name parameter.

第二种情况的代码较少,但对于新手来说可读性较差,我更喜欢第一种使用to_frame()的方法,因为它使意图很明确:将此计数序列转换为数据框,并将列'module_id重命名为'到'count'".

The second case has lesser code but is less readable for new eyes and I'd prefer the first approach of using to_frame() because it makes the intent clear: "Convert this count series to a dataframe and rename the column 'module_id' to 'count'".

这篇关于为什么要在reset_index之前使用to_frame?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆