pandas 的简单交叉制表 [英] Simple cross-tabulation in pandas
问题描述
我偶然发现了 pandas ,它非常适合我想做的简单计算.我有SAS背景,并且正在考虑将其替换为proc freq -看起来它可以扩展到我将来可能要做的事情.但是,我似乎似乎无法解决一个简单的任务(我不确定是否应该看pivot/crosstab/indexing
-是否应该使用Panel
或DataFrames
等... ).有人可以给我一些有关如何执行以下操作的提示:
I stumbled across pandas and it looks ideal for simple calculations that I'd like to do. I have a SAS background and was thinking it'd replace proc freq -- it looks like it'll scale to what I may want to do in the future. However, I just can't seem to get my head around a simple task (I'm not sure if I'm supposed to look at pivot/crosstab/indexing
- whether I should have a Panel
or DataFrames
etc...). Could someone give me some pointers on how to do the following:
我有两个CSV文件(一个用于2010年,一个用于2011年-简单的交易数据)-列是类别和金额
I have two CSV files (one for year 2010, one for year 2011 - simple transactional data) - The columns are category and amount
2010:
AB,100.00
AB,200.00
AC,150.00
AD,500.00
2011:
AB,500.00
AC,250.00
AX,900.00
这些被加载到单独的DataFrame对象中.
These are loaded into separate DataFrame objects.
我想做的就是获取类别,类别的总和以及类别的频率,例如:
What I'd like to do is get the category, the sum of the category, and the frequency of the category, eg:
2010:
AB,300.00,2
AC,150.00,1
AD,500.00,1
2011:
AB,500.00,1
AC,250.00,1
AX,900.00,1
我无法确定我是否应该使用pivot/crosstab/groupby/an index
等等...我既可以求和,也可以得到频率-我似乎无法两者兼得...它变得更加复杂,因为我想每月进行一次,但是我想如果有人会很高兴为我指出正确的技术/方向,使我能够从那里走下去.
I can't work out whether I should be using pivot/crosstab/groupby/an index
etc... I can get either the sum or the frequency - I can't seem to get both... It gets a bit more complex because I would like to do it on a month by month basis, but I think if someone would be so kind to point me to the right technique/direction I'll be able to go from there.
推荐答案
假设您有一个名为2010.csv的文件,其中包含内容
Assuming that you have a file called 2010.csv with contents
category,value
AB,100.00
AB,200.00
AC,150.00
AD,500.00
然后,使用以下功能:应用在groupby之后有多个聚合功能,您可以说:
Then, using the ability to apply multiple aggregation functions following a groupby, you can say:
import pandas
data_2010 = pandas.read_csv("/path/to/2010.csv")
data_2010.groupby("category").agg([len, sum])
您应该得到一个类似
value
len sum
category
AB 2 300
AC 1 150
AD 1 500
请注意,Wes可能会指出总和已优化,您可能应该使用np.sum.
Note that Wes will likely come by to point out that sum is optimized and that you should probably use np.sum.
这篇关于 pandas 的简单交叉制表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!