pandas 枢纽销售表 [英] pandas pivot table of sales

查看：216 发布时间：2017/2/24 22:40:57 python csv pandas numpy

本文介绍了 pandas 枢纽销售表的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有类似下面的列表：

  saleid UPC 
 0 155_02127453_20090616_135212_0021 023176.39亿
 1 155_02127453_20090616_135212_0021 00000000000888 
 2 155_01605733_20090616_135221_0016 002648.5亿
 3 155_01072401_20090616_135224_0010 023168.77亿
 4 155_01072401_20090616_135224_0010 05051969277205

它代表一个客户（saleid）和他/她获得的项目（项目的upc）

我想要的是将此表格转换为如下形式： / p>

  023176.39亿00000000000888 002648.5亿023168.77亿
 155_02127453_20090616_135212_0021 1 1 0 0 
 155_01605733_20090616_135221_0016 0 0 1 0 
 155_01072401_20090616_135224_0010 0 0 0 0

因此，列是唯一的UPC，行是唯一的SALEID。

我这样读：

  tbl = pd.read_csv（'tbl_sale_items.csv '，sep =';'，dtype = {'saleid'：np.str，'upc'：np.str}）
 tbl.info（）
< class'pandas.core.frame .DataFrame'> 
 RangeIndex：18570726条目，0到18570725 
数据列（总共2列）：
 saleid对象
 upc对象
 dtypes：object（2）
内存使用：283.4+ MB

我已经做了一些步骤，但不是正确的！

  tbl.pivot_table（列= ['UPC']，aggfunc = pd.Series.nunique）
 UPC 00000000000000 00000000000109 00000000000116 00000000000123 00000000000130 00000000000147 00000000000154 00000000000161 00000000000178 00000000000185 ... 
 saleid 44950 287 26180 4881 1839 623 3347 7

编辑：
Im使用以下解决方案变体：

  chunksize = 1000000 
f = 0 
对于chunk在pd.read_csv（'tbl_sale_items.csv'，sep =';'，dtype = {'saleid'：np.str，'upc'：np.str}，chunksize = chunksize）：
 print f）
t = pd.crosstab（chunk.saleid，chunk.upc）
 t.head（3）
 t.to_csv（'tbl_sales_index_converted_'+ str（f）+'.csv。 bz2'，header = True，sep =';'，compression ='bz2'）
f = f + 1

b $ b

原始文件非常大，以适应转换后的内存。
上面的解决方案有问题，因为我从所有的文件没有所有的列，因为我从原始文件读取块。

问题2：是有没有办法强制所有块具有相同的列？

解决方案

  df.groupby（['saleid'，'upc']）。size（）。unstack（fill_value = 0）

选项2

  pd .crosstab（df.saleid，df.upc）

I have a list like below:
saleid upc 0 155_02127453_20090616_135212_0021 02317639000000 1 155_02127453_20090616_135212_0021 00000000000888 2 155_01605733_20090616_135221_0016 00264850000000 3 155_01072401_20090616_135224_0010 02316877000000 4 155_01072401_20090616_135224_0010 05051969277205
It represents one customer (saleid) and the items he/she got (upc of the item)

What I want is to pivot this table to a form like below:
02317639000000 00000000000888 00264850000000 02316877000000 155_02127453_20090616_135212_0021 1 1 0 0 155_01605733_20090616_135221_0016 0 0 1 0 155_01072401_20090616_135224_0010 0 0 0 0
So, columns are unique UPCs and rows are unique SALEIDs.

i read it like this:
tbl = pd.read_csv('tbl_sale_items.csv',sep=';',dtype={'saleid': np.str, 'upc': np.str}) tbl.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 18570726 entries, 0 to 18570725 Data columns (total 2 columns): saleid object upc object dtypes: object(2) memory usage: 283.4+ MB
I have done some steps but not the correct ones!
tbl.pivot_table(columns=['upc'],aggfunc=pd.Series.nunique) upc 00000000000000 00000000000109 00000000000116 00000000000123 00000000000130 00000000000147 00000000000154 00000000000161 00000000000178 00000000000185 ... saleid 44950 287 26180 4881 1839 623 3347 7
EDIT: Im using the solution variation below:
chunksize = 1000000 f = 0 for chunk in pd.read_csv('tbl_sale_items.csv',sep=';',dtype={'saleid': np.str, 'upc': np.str}, chunksize=chunksize): print(f) t = pd.crosstab(chunk.saleid, chunk.upc) t.head(3) t.to_csv('tbl_sales_index_converted_' + str(f) + '.csv.bz2',header=True,sep=';',compression='bz2') f = f+1
the original file is extremely big to fit to memory after conversion. The above solution has the problem on not having all the columns on all the files as I'm reading chunks from the original file.

Question 2: is there a way to force all chunks to have the same columns?
解决方案
Option 1
df.groupby(['saleid', 'upc']).size().unstack(fill_value=0)

Option 2
pd.crosstab(df.saleid, df.upc)

Setup

from StringIO import StringIO import pandas as pd text = """ saleid upc 0 155_02127453_20090616_135212_0021 02317639000000 1 155_02127453_20090616_135212_0021 00000000000888 2 155_01605733_20090616_135221_0016 00264850000000 3 155_01072401_20090616_135224_0010 02316877000000 4 155_01072401_20090616_135224_0010 05051969277205""" df = pd.read_csv(StringIO(text), delim_whitespace=True, dtype=str) df

这篇关于 pandas 枢纽销售表的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

pandas 枢纽销售表 [英] pandas pivot table of sales

问题描述

Setup

相关文章

Python最新文章

热门教程

热门工具

登录关闭

pandas 枢纽销售表 [英] pandas pivot table of sales

问题描述

Setup

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭