为什么要在reset_index之前使用to_frame? [英] Why use to_frame before reset_index?
问题描述
使用这样的数据集
df = pd.DataFrame(np.random.randint(0,5,size=(20, 3)), columns=['user_id','module_id','week'])
我们经常看到这种模式:
we often see this pattern:
df.groupby(['user_id'])['module_id'].count().to_frame().reset_index().rename({'module_id':'count'}, axis='columns')
但是我们从
df.groupby(['user_id'])['module_id'].count().reset_index(name='count')
(注意,我们在前一个版本中需要附加的rename
,因为Series上的reset_index
(此处)不包含name
参数.)
(N.B. we need the additional rename
in the former because reset_index
on Series (here) includes a name
parameter and returns a data frame, while reset_index
on DataFrame (here) does not include the name
parameter.)
首先使用to_frame
有什么优势吗?
Is there any advantage in using to_frame
first?
(我想知道这是否可能是早期版本的熊猫的人工制品,但看起来不太可能:
(I wondered if it might be an artefact of earlier versions of pandas, but that looks unlikely:
Series.reset_index
was added in this commit on the 27th of January 2012.Series.to_frame
was added in this commit on the 13th of October 2013.
因此Series.reset_index
可以在Series.to_frame
之前的一年内使用.)
So Series.reset_index
was available over a year before Series.to_frame
.)
推荐答案
使用to_frame()
没有明显的优势.两种方法均可用于获得相同的结果.在大熊猫中,通常使用多种方法来解决问题.我能想到的唯一优点是,对于较大的数据集,在重置索引之前先具有数据框视图可能更方便.如果以您的数据框为例,您会发现to_frame()
显示了一个数据框视图,该视图对于根据整洁的数据框表v/s和count
系列而言可能对理解数据很有用.另外,to_frame()
的用法对于初次查看您的代码的新用户来说,意图更加清晰.
There is no noticeable advantage of using to_frame()
. Both approaches can be used to achieve the same result. It is common in pandas to use multiple approaches for solving a problem. The only advantage I can think of is that for larger sets of data, it maybe more convenient to have a dataframe view first before resetting the index. If we take your dataframe as an example, you will find that to_frame()
displays a dataframe view that maybe useful to understand the data in terms of a neat dataframe table v/s a count
series. Also, the usage of to_frame()
makes the intent more clear to a new user who looks at your code for the first time.
示例数据框:
In [7]: df = pd.DataFrame(np.random.randint(0,5,size=(20, 3)), columns=['user_i
...: d','module_id','week'])
In [8]: df.head()
Out[8]:
user_id module_id week
0 3 4 4
1 1 3 4
2 1 2 2
3 1 3 4
4 1 2 2
count()
函数返回一个Series:
The count()
function returns a Series:
In [18]: test1 = df.groupby(['user_id'])['module_id'].count()
In [19]: type(test1)
Out[19]: pandas.core.series.Series
In [20]: test1
Out[20]:
user_id
0 2
1 7
2 4
3 6
4 1
Name: module_id, dtype: int64
In [21]: test1.index
Out[21]: Int64Index([0, 1, 2, 3, 4], dtype='int64', name='user_id')
使用to_frame
可以明确表明您打算将Series转换为Dataframe.此处的索引为user_id
:
Using to_frame
makes it explicit that you intend to convert the Series to a Dataframe. The index here is user_id
:
In [22]: test1.to_frame()
Out[22]:
module_id
user_id
0 2
1 7
2 4
3 6
4 1
现在,我们重置索引并使用Dataframe.rename重命名该列.正如您正确指出的那样,Dataframe.reset_index()
没有一个name
参数,因此,我们将必须显式重命名该列.
And now we reset the index and rename the column using Dataframe.rename. As you rightly pointed, Dataframe.reset_index()
does not have a name
parameter and therefore, we will have to rename the column explicitly.
In [24]: testdf1 = test1.to_frame().reset_index().rename({'module_id':'count'}, axis='columns')
In [25]: testdf1
Out[25]:
user_id count
0 0 2
1 1 7
2 2 4
3 3 6
4 4 1
现在让我们看看另一种情况.我们将使用相同的count()
系列test1
,但将其重命名为test2
以区分这两种方法.换句话说,test1
等于test2
.
Now lets look at the other case. We will use the same count()
series test1
but rename it as test2
to differentiate between the two approaches. In other words, test1
is equal to test2
.
In [26]: test2 = df.groupby(['user_id'])['module_id'].count()
In [27]: test2
Out[27]:
user_id
0 2
1 7
2 4
3 6
4 1
Name: module_id, dtype: int64
In [28]: test2.reset_index()
Out[28]:
user_id module_id
0 0 2
1 1 7
2 2 4
3 3 6
4 4 1
In [30]: testdf2 = test2.reset_index(name='count')
In [31]: testdf1 == testdf2
Out[31]:
user_id count
0 True True
1 True True
2 True True
3 True True
4 True True
如您所见,两个数据帧都是等效的,在第二种方法中,我们只需要使用reset_index(name='count')
重设索引并重命名列名,因为Series.reset_index()
确实具有name
参数.
As you can see both dataframes are equivalent, and in the second approach we just had to use reset_index(name='count')
to both reset the index and rename the column name because Series.reset_index()
does have a name
parameter.
第二种情况的代码较少,但对于新手来说可读性较差,我更喜欢第一种使用to_frame()
的方法,因为它使意图很明确:将此计数序列转换为数据框,并将列'module_id重命名为'到'count'".
The second case has lesser code but is less readable for new eyes and I'd prefer the first approach of using to_frame()
because it makes the intent clear: "Convert this count series to a dataframe and rename the column 'module_id' to 'count'".
这篇关于为什么要在reset_index之前使用to_frame?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!