pandas 枢纽销售表 [英] pandas pivot table of sales

查看:216
本文介绍了 pandas 枢纽销售表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有类似下面的列表:

  saleid UPC 
0 155_02127453_20090616_135212_0021 023176.39亿
1 155_02127453_20090616_135212_0021 00000000000888
2 155_01605733_20090616_135221_0016 002648.5亿
3 155_01072401_20090616_135224_0010 023168.77亿
4 155_01072401_20090616_135224_0010 05051969277205

它代表一个客户(saleid)和他/她获得的项目(项目的upc)



我想要的是将此表格转换为如下形式: / p>

  023176.39亿00000000000888 002648.5亿023168.77亿
155_02127453_20090616_135212_0021 1 1 0 0
155_01605733_20090616_135221_0016 0 0 1 0
155_01072401_20090616_135224_0010 0 0 0 0

因此,列是唯一的UPC,行是唯一的SALEID。



我这样读:

  tbl = pd.read_csv('tbl_sale_items.csv ',sep =';',dtype = {'saleid':np.str,'upc':np.str})
tbl.info()
< class'pandas.core.frame .DataFrame'>
RangeIndex:18570726条目,0到18570725
数据列(总共2列):
saleid对象
upc对象
dtypes:object(2)
内存使用:283.4+ MB

我已经做了一些步骤,但不是正确的!

  tbl.pivot_table(列= ['UPC'],aggfunc = pd.Series.nunique)
UPC 00000000000000 00000000000109 00000000000116 00000000000123 00000000000130 00000000000147 00000000000154 00000000000161 00000000000178 00000000000185 ...
saleid 44950 287 26180 4881 1839 623 3347 7

编辑:
Im使用以下解决方案变体:

  chunksize = 1000000 
f = 0
对于chunk在pd.read_csv('tbl_sale_items.csv',sep =';',dtype = {'saleid':np.str,'upc':np.str},chunksize = chunksize):
print f)
t = pd.crosstab(chunk.saleid,chunk.upc)
t.head(3)
t.to_csv('tbl_sales_index_converted_'+ str(f)+'.csv。 bz2',header = True,sep =';',compression ='bz2')
f = f + 1


b $ b

原始文件非常大,以适应转换后的内存。
上面的解决方案有问题,因为我从所有的文件没有所有的列,因为我从原始文件读取块。



问题2:是有没有办法强制所有块具有相同的列?

解决方案

  df.groupby(['saleid','upc'])。size()。unstack(fill_value = 0)



选项2

  pd .crosstab(df.saleid,df.upc)


I have a list like below:

    saleid                              upc
0   155_02127453_20090616_135212_0021   02317639000000
1   155_02127453_20090616_135212_0021   00000000000888
2   155_01605733_20090616_135221_0016   00264850000000
3   155_01072401_20090616_135224_0010   02316877000000
4   155_01072401_20090616_135224_0010   05051969277205

It represents one customer (saleid) and the items he/she got (upc of the item)

What I want is to pivot this table to a form like below:

                                   02317639000000 00000000000888 00264850000000 02316877000000
155_02127453_20090616_135212_0021               1              1              0              0
155_01605733_20090616_135221_0016               0              0              1              0
155_01072401_20090616_135224_0010               0              0              0              0

So, columns are unique UPCs and rows are unique SALEIDs.

i read it like this:

tbl = pd.read_csv('tbl_sale_items.csv',sep=';',dtype={'saleid': np.str, 'upc': np.str})
tbl.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18570726 entries, 0 to 18570725
Data columns (total 2 columns):
saleid    object
upc       object
dtypes: object(2)
memory usage: 283.4+ MB

I have done some steps but not the correct ones!

tbl.pivot_table(columns=['upc'],aggfunc=pd.Series.nunique)
upc 00000000000000  00000000000109  00000000000116  00000000000123  00000000000130  00000000000147  00000000000154  00000000000161  00000000000178  00000000000185  ...
saleid  44950   287 26180   4881    1839    623 3347    7

EDIT: Im using the solution variation below:

chunksize = 1000000
f = 0
for chunk in pd.read_csv('tbl_sale_items.csv',sep=';',dtype={'saleid': np.str, 'upc': np.str}, chunksize=chunksize):
    print(f)
    t = pd.crosstab(chunk.saleid, chunk.upc)
    t.head(3)
    t.to_csv('tbl_sales_index_converted_' + str(f) + '.csv.bz2',header=True,sep=';',compression='bz2')
    f = f+1

the original file is extremely big to fit to memory after conversion. The above solution has the problem on not having all the columns on all the files as I'm reading chunks from the original file.

Question 2: is there a way to force all chunks to have the same columns?

解决方案

Option 1

df.groupby(['saleid', 'upc']).size().unstack(fill_value=0)

Option 2

pd.crosstab(df.saleid, df.upc)

Setup

from StringIO import StringIO
import pandas as pd

text = """    saleid                              upc
0   155_02127453_20090616_135212_0021   02317639000000
1   155_02127453_20090616_135212_0021   00000000000888
2   155_01605733_20090616_135221_0016   00264850000000
3   155_01072401_20090616_135224_0010   02316877000000
4   155_01072401_20090616_135224_0010   05051969277205"""

df = pd.read_csv(StringIO(text), delim_whitespace=True, dtype=str)
df

这篇关于 pandas 枢纽销售表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆