pandas 的简单交叉制表 [英] Simple cross-tabulation in pandas

查看:96
本文介绍了 pandas 的简单交叉制表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我偶然发现了 pandas ,它非常适合我想做的简单计算.我有SAS背景,并且正在考虑将其替换为proc freq -看起来它可以扩展到我将来可能要做的事情.但是,我似乎似乎无法解决一个简单的任务(我不确定是否应该看pivot/crosstab/indexing-是否应该使用PanelDataFrames等... ).有人可以给我一些有关如何执行以下操作的提示:

I stumbled across pandas and it looks ideal for simple calculations that I'd like to do. I have a SAS background and was thinking it'd replace proc freq -- it looks like it'll scale to what I may want to do in the future. However, I just can't seem to get my head around a simple task (I'm not sure if I'm supposed to look at pivot/crosstab/indexing - whether I should have a Panel or DataFrames etc...). Could someone give me some pointers on how to do the following:

我有两个CSV文件(一个用于2010年,一个用于2011年-简单的交易数据)-列是类别和金额

I have two CSV files (one for year 2010, one for year 2011 - simple transactional data) - The columns are category and amount

2010:

AB,100.00
AB,200.00
AC,150.00
AD,500.00

2011:

AB,500.00
AC,250.00
AX,900.00

这些被加载到单独的DataFrame对象中.

These are loaded into separate DataFrame objects.

我想做的就是获取类别,类别的总和以及类别的频率,例如:

What I'd like to do is get the category, the sum of the category, and the frequency of the category, eg:

2010:

AB,300.00,2
AC,150.00,1
AD,500.00,1

2011:

AB,500.00,1
AC,250.00,1
AX,900.00,1

我无法确定我是否应该使用pivot/crosstab/groupby/an index 等等...我既可以求和,也可以得到频率-我似乎无法两者兼得...它变得更加复杂,因为我想每月进行一次,但是我想如果有人会很高兴为我指出正确的技术/方向,使我能够从那里走下去.

I can't work out whether I should be using pivot/crosstab/groupby/an index etc... I can get either the sum or the frequency - I can't seem to get both... It gets a bit more complex because I would like to do it on a month by month basis, but I think if someone would be so kind to point me to the right technique/direction I'll be able to go from there.

推荐答案

假设您有一个名为2010.csv的文件,其中包含内容

Assuming that you have a file called 2010.csv with contents

category,value
AB,100.00
AB,200.00
AC,150.00
AD,500.00

然后,使用以下功能:应用在groupby之后有多个聚合功能,您可以说:

Then, using the ability to apply multiple aggregation functions following a groupby, you can say:

import pandas
data_2010 = pandas.read_csv("/path/to/2010.csv")
data_2010.groupby("category").agg([len, sum])

您应该得到一个类似

          value     
            len  sum
category            
AB            2  300
AC            1  150
AD            1  500

请注意,Wes可能会指出总和已优化,您可能应该使用np.sum.

Note that Wes will likely come by to point out that sum is optimized and that you should probably use np.sum.

这篇关于 pandas 的简单交叉制表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆