pandas DataFrame数据透视使用日期和数量 [英] Pandas DataFrame Pivot Using Dates and Counts

查看:861
本文介绍了 pandas DataFrame数据透视使用日期和数量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经使用了大量的数据文件,并使用groupby和value_counts来获取下面的数据框。但是,我想格式化,所以公司在左边,几个月在上,每个号码将是这个月的电话号码,第三列。

I've taken a large data file and managed to use groupby and value_counts to get the dataframe below. However, I want to format it so the company is on the left, with the months on top, and each number would be the number of calls that month, the third column.

这是我的代码排序:

data = pd.DataFrame.from_csv('MYDATA.csv')

data[['recvd_dttm','CompanyName']]
data['recvd_dttm'].value_counts()  

count = data.groupby(["recvd_dttm","CompanyName"]).size()
df = pd.DataFrame(count)
df.pivot(index='recvd_dttm', columns='CompanyName', values='NumberCalls')

这是我的输出df =

Here is my output df=

recvd_dttm      CompanyName                           
1/1/2015 11:42  Company 1      1
1/1/2015 14:29  Company 2      1
1/1/2015 8:12   Company 4      1
1/1/2015 9:53   Company 1      1
1/10/2015 11:38 Company 3      1
1/10/2015 11:31 Company 5      1
1/10/2015 12:04 Company 2      1

我想要

Company     Jan Feb Mar Apr May
Company 1   10  4   45  40  34
Company 2   2   5   56  5   57
Company 3   3   7   71  6   53
Company 4   4   4   38  32  2
Company 5   20  3   3   3   29

我知道这个文档中的数据框有一个漂亮的枢纽功能 http://pandas.pydata.org/pandas-docs/stable/reshaping.html 为大熊猫,所以我一直在尝试使用df.pivot(index ='recvd_dttm',columns ='CompanyName' ,values ='NumberCalls')

I know that there is a nifty pivot function for dataframes from this documentation http://pandas.pydata.org/pandas-docs/stable/reshaping.html for pandas, so I've been trying to use df.pivot(index='recvd_dttm', columns='CompanyName', values='NumberCalls')

一个问题是第三列没有名称,因此我不能将其用于values ='NumberCalls'。第二个问题是找出如何在数据帧中采用datetime格式,并按月显示。

One problem is that the third column doesn't have a name, so I can't use it for values = 'NumberCalls'. The second problem is figuring out how to take the datetime format in my dataframe and make it display by month only.

编辑:
CompanyName是第一列,recvd_dttm是第15列。这是我经过一些尝试后的代码:

CompanyName is the first column, recvd_dttm is the 15th column. This is my code after some more attempts:

data = pd.DataFrame.from_csv('MYDATA.csv')

data[['recvd_dttm','CompanyName']]
data['recvd_dttm'].value_counts()
RatedCustomerCallers = data['CompanyName'].value_counts()


count = data.groupby(["recvd_dttm","CompanyName"]).size()
df = pd.DataFrame(count).set_index('recvd_dttm').sort_index()
df.index = pd.to_datetime(df.index, format='%m/%d/%Y %H:%M')
result = df.groupby([lambda idx: idx.month, 'CompanyName']).agg({df.columns[1]: sum}).reset_index()
result.columns = ['Month', 'CompanyName', 'NumberCalls']

result.pivot(index='recvd_dttm', columns='CompanyName', values='NumberCalls')

它抛出这个错误:KeyError:'recvd_dttm',不会得到结果行。

It is throwing this error: KeyError: 'recvd_dttm' and won't get to the result line.

推荐答案

您需要在创建之前汇总数据启动数据透视表。如果没有列名,可以将它引用到 df.iloc [:, 1] (第二列),或者简单地重命名df。

You need to aggregate the data before creating the pivot table. If there is no column name, you can either refer it to df.iloc[:, 1] (the 2nd column) or simply rename the df.

import pandas as pd
import numpy as np

# just simulate your data
np.random.seed(0)
dates = np.random.choice(pd.date_range('2015-01-01 00:00:00', '2015-06-30 00:00:00', freq='1h'), 10000)
company = np.random.choice(['company' + x for x in '1 2 3 4 5'.split()], 10000)
df = pd.DataFrame(dict(recvd_dttm=dates, CompanyName=company)).set_index('recvd_dttm').sort_index()
df['C'] = 1
df.columns = ['CompanyName', '']

Out[34]: 
                    CompnayName   
recvd_dttm                        
2015-01-01 00:00:00    company2  1
2015-01-01 00:00:00    company2  1
2015-01-01 00:00:00    company1  1
2015-01-01 00:00:00    company2  1
2015-01-01 01:00:00    company4  1
2015-01-01 01:00:00    company2  1
2015-01-01 01:00:00    company5  1
2015-01-01 03:00:00    company3  1
2015-01-01 03:00:00    company2  1
2015-01-01 03:00:00    company3  1
2015-01-01 04:00:00    company4  1
2015-01-01 04:00:00    company1  1
2015-01-01 04:00:00    company3  1
2015-01-01 05:00:00    company2  1
2015-01-01 06:00:00    company5  1
...                         ... ..
2015-06-29 19:00:00    company2  1
2015-06-29 19:00:00    company2  1
2015-06-29 19:00:00    company3  1
2015-06-29 19:00:00    company3  1
2015-06-29 19:00:00    company5  1
2015-06-29 19:00:00    company5  1
2015-06-29 20:00:00    company1  1
2015-06-29 20:00:00    company4  1
2015-06-29 22:00:00    company1  1
2015-06-29 22:00:00    company2  1
2015-06-29 22:00:00    company4  1
2015-06-30 00:00:00    company1  1
2015-06-30 00:00:00    company2  1
2015-06-30 00:00:00    company1  1
2015-06-30 00:00:00    company4  1

[10000 rows x 2 columns]

# first groupby month and company name, and calculate the sum of calls, and reset all index
# since we don't have a name for that columns, simply tell pandas it is the 2nd column we try to count on
result = df.groupby([lambda idx: idx.month, 'CompanyName']).agg({df.columns[1]: sum}).reset_index()
# rename the columns
result.columns = ['Month', 'CompanyName', 'counts']

Out[41]: 
    Month CompanyName  counts
0       1    company1     328
1       1    company2     337
2       1    company3     342
3       1    company4     345
4       1    company5     331
5       2    company1     295
6       2    company2     300
7       2    company3     328
8       2    company4     304
9       2    company5     329
10      3    company1     366
11      3    company2     398
12      3    company3     339
13      3    company4     336
14      3    company5     345
15      4    company1     322
16      4    company2     348
17      4    company3     351
18      4    company4     340
19      4    company5     312
20      5    company1     347
21      5    company2     354
22      5    company3     347
23      5    company4     363
24      5    company5     312
25      6    company1     316
26      6    company2     311
27      6    company3     331
28      6    company4     307
29      6    company5     316

# create pivot table
result.pivot(index='CompanyName', columns='Month', values='counts')

Out[44]: 
Month          1    2    3    4    5    6
CompanyName                              
company1     326  297  339  337  344  308
company2     310  318  342  328  355  296
company3     347  315  350  343  347  329
company4     339  314  367  353  343  311
company5     370  331  370  320  357  294

这篇关于 pandas DataFrame数据透视使用日期和数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆