pandas 频率表(如R中的plyr) [英] Frequency tables in pandas (like plyr in R)
问题描述
我的问题是如何计算熊猫多个变量的频率。
我从这个数据框中获得:
pre $ d1 = pd.DataFrame({'StudentID':[x1, x10,x2,x3,x4,x5,x6,x7,x8,x9],
'StudentGender':['F', 'M','F','M','F','M','F','M','M','M'],
'ExamenYear':['2007', '2007','2007','2008','2008','2008','2008','2009','2009','2009'],
'考试':['algebra', '统计','生物','代数','代数','统计','统计','代数','生物','生物'],
'参与':['no', '是','是','是','否','是','是','是','是','是'],
'通过':['no', '是','是','是','否','是','是','是','否','是']},
列= ['StudentID',' StudentGender','ExamenYear', 'Exam','Participated','Passed'])
以下结果
参与分配
ExamenYear
2007 3 2
2008 4 3
2009 3 2
(1)我试过的一种可能性是计算两个数据框并将它们绑定在一起
t1 = d1.pivot_table(values ='StudentID',rows = ['ExamenYear'],cols = ['Participated'],aggfunc = len)
t2 = d1.pivot_table(values ='StudentID',rows = ['ExamenYear'],cols = ['Passed'],aggfunc = len)
tx = pd.concat([t1,t2], axis = 1)
Res1 = tx ['yes']
2)第二种可能性是使用聚合函数。
汇入汇款
dg = d1.groupby('ExamenYear')
Res2 = dg.agg( {'Participated':len,'Passed':lambda x:collections.Counter(x =='yes')[True]})
Res2.columns = ['Participated','OfWhichpassed' ]
至少可以这样说。
在熊猫中这样做是否正确?
PS:我也试过 value_counts em> collections.Counter 但无法使用
--- - UPDATE ------
user DSM 是正确的。在所需的表格结果中出现错误。
(1)选项一的代码是:
$ $ p $ t1 = d1.pivot_table(values ='StudentID',rows = ['ExamenYear'],aggfunc = len)
t2 = d1.pivot_table(values ='StudentID',rows = ['ExamenYear'],cols = ['参与'],aggfunc = len)
t3 = d1.pivot_table(values ='StudentID',rows = ['ExamenYear'],cols = ['Passed'],aggfunc = len)
Res1 = pd.DataFrame({'All':t1,$ b $'OfWhichParticipated':t2 ['yes'],
'OfWhichPassed':t3 ['yes']})
它会产生结果
All OfWhichParticipated OfWhichPassed
$ (2)对于Optio n 2,感谢用户 herrfz ,我想出了如何使用value_count,代码将会是
ExamenYear
2007 3 2 2
2008 4 3 3
2009 3 3 2
Res2 = d1.groupby('ExamenYear')。agg({'StudentID':len,
'Participated':lambda x:x.value_counts()['yes'],
'Passed' :lambda x:x.value_counts()['yes']})
Res2.columns = ['All','OfWgichParticipated','OfWhichPassed']
这将产生与Res1相同的结果
我的问题仍然存在: p>
使用选项2,是否可以两次使用相同的变量(用于其他操作?)是否可以为结果变量传递一个自定义名称?
----新更新----
我有最后决定使用应用,我了解的更灵活。
解决方案
d1.groupby('ExamenYear')。agg({'Participated':len,
'Passed':lambda x:sum(x =='yes')})
看起来不像R解决方案那么尴尬,恕我直言。
My problem is how to calculate frequencies on multiple variables in pandas . I have from this dataframe :
d1 = pd.DataFrame( {'StudentID': ["x1", "x10", "x2","x3", "x4", "x5", "x6", "x7", "x8", "x9"], 'StudentGender' : ['F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'M'], 'ExamenYear': ['2007','2007','2007','2008','2008','2008','2008','2009','2009','2009'], 'Exam': ['algebra', 'stats', 'bio', 'algebra', 'algebra', 'stats', 'stats', 'algebra', 'bio', 'bio'], 'Participated': ['no','yes','yes','yes','no','yes','yes','yes','yes','yes'], 'Passed': ['no','yes','yes','yes','no','yes','yes','yes','no','yes']}, columns = ['StudentID', 'StudentGender', 'ExamenYear', 'Exam', 'Participated', 'Passed'])
To the following result
Participated OfWhichpassed ExamenYear 2007 3 2 2008 4 3 2009 3 2
(1) One possibility I tried is to compute two dataframe and bind them
t1 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], cols = ['Participated'], aggfunc = len) t2 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], cols = ['Passed'], aggfunc = len) tx = pd.concat([t1, t2] , axis = 1) Res1 = tx['yes']
(2) The second possibility is to use an aggregation function .
import collections dg = d1.groupby('ExamenYear') Res2 = dg.agg({'Participated': len,'Passed': lambda x : collections.Counter(x == 'yes')[True]}) Res2.columns = ['Participated', 'OfWhichpassed']
Both ways are awckward to say the least. How is this done properly in pandas ?
P.S: I also tried value_counts instead of collections.Counter but could not get it to work
For reference: Few months ago, I asked similar question for R here and plyr could help
---- UPDATE ------
user DSM is right. there was a mistake in the desired table result.
(1) The code for option one is
t1 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], aggfunc = len) t2 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], cols = ['Participated'], aggfunc = len) t3 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], cols = ['Passed'], aggfunc = len) Res1 = pd.DataFrame( {'All': t1, 'OfWhichParticipated': t2['yes'], 'OfWhichPassed': t3['yes']})
It will produce the result
All OfWhichParticipated OfWhichPassed ExamenYear 2007 3 2 2 2008 4 3 3 2009 3 3 2
(2) For Option 2, thanks to user herrfz, I figured out how to use value_count and the code will be
Res2 = d1.groupby('ExamenYear').agg({'StudentID': len, 'Participated': lambda x: x.value_counts()['yes'], 'Passed': lambda x: x.value_counts()['yes']}) Res2.columns = ['All', 'OfWgichParticipated', 'OfWhichPassed']
which will produce the same result as Res1
My question remains though:
Using Option 2, will it be possible to use the same Variable twice (for another operation ?) can one pass a custom name for the resulting variable ?
---- A NEW UPDATE ----
I have finally decided to use apply which I understand is more flexible.
解决方案This:
d1.groupby('ExamenYear').agg({'Participated': len, 'Passed': lambda x: sum(x == 'yes')})
doesn't look way more awkward than the R solution, IMHO.
这篇关于 pandas 频率表(如R中的plyr)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!