pandas 枢纽销售表 [英] pandas pivot table of sales
问题描述
我有类似下面的列表:
saleid UPC
0 155_02127453_20090616_135212_0021 023176.39亿
1 155_02127453_20090616_135212_0021 00000000000888
2 155_01605733_20090616_135221_0016 002648.5亿
3 155_01072401_20090616_135224_0010 023168.77亿
4 155_01072401_20090616_135224_0010 05051969277205
它代表一个客户(saleid)和他/她获得的项目(项目的upc)
我想要的是将此表格转换为如下形式: / p>
023176.39亿00000000000888 002648.5亿023168.77亿
155_02127453_20090616_135212_0021 1 1 0 0
155_01605733_20090616_135221_0016 0 0 1 0
155_01072401_20090616_135224_0010 0 0 0 0
因此,列是唯一的UPC,行是唯一的SALEID。
我这样读:
tbl = pd.read_csv('tbl_sale_items.csv ',sep =';',dtype = {'saleid':np.str,'upc':np.str})
tbl.info()
< class'pandas.core.frame .DataFrame'>
RangeIndex:18570726条目,0到18570725
数据列(总共2列):
saleid对象
upc对象
dtypes:object(2)
内存使用:283.4+ MB
我已经做了一些步骤,但不是正确的!
tbl.pivot_table(列= ['UPC'],aggfunc = pd.Series.nunique)
UPC 00000000000000 00000000000109 00000000000116 00000000000123 00000000000130 00000000000147 00000000000154 00000000000161 00000000000178 00000000000185 ...
saleid 44950 287 26180 4881 1839 623 3347 7
编辑:
Im使用以下解决方案变体:
chunksize = 1000000
f = 0
对于chunk在pd.read_csv('tbl_sale_items.csv',sep =';',dtype = {'saleid':np.str,'upc':np.str},chunksize = chunksize):
print f)
t = pd.crosstab(chunk.saleid,chunk.upc)
t.head(3)
t.to_csv('tbl_sales_index_converted_'+ str(f)+'.csv。 bz2',header = True,sep =';',compression ='bz2')
f = f + 1
b $ b
原始文件非常大,以适应转换后的内存。
上面的解决方案有问题,因为我从所有的文件没有所有的列,因为我从原始文件读取块。
问题2:是有没有办法强制所有块具有相同的列?
df.groupby(['saleid','upc'])。size()。unstack(fill_value = 0)
选项2
pd .crosstab(df.saleid,df.upc)
I have a list like below:
saleid upc
0 155_02127453_20090616_135212_0021 02317639000000
1 155_02127453_20090616_135212_0021 00000000000888
2 155_01605733_20090616_135221_0016 00264850000000
3 155_01072401_20090616_135224_0010 02316877000000
4 155_01072401_20090616_135224_0010 05051969277205
It represents one customer (saleid) and the items he/she got (upc of the item)
What I want is to pivot this table to a form like below:
02317639000000 00000000000888 00264850000000 02316877000000
155_02127453_20090616_135212_0021 1 1 0 0
155_01605733_20090616_135221_0016 0 0 1 0
155_01072401_20090616_135224_0010 0 0 0 0
So, columns are unique UPCs and rows are unique SALEIDs.
i read it like this:
tbl = pd.read_csv('tbl_sale_items.csv',sep=';',dtype={'saleid': np.str, 'upc': np.str})
tbl.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18570726 entries, 0 to 18570725
Data columns (total 2 columns):
saleid object
upc object
dtypes: object(2)
memory usage: 283.4+ MB
I have done some steps but not the correct ones!
tbl.pivot_table(columns=['upc'],aggfunc=pd.Series.nunique)
upc 00000000000000 00000000000109 00000000000116 00000000000123 00000000000130 00000000000147 00000000000154 00000000000161 00000000000178 00000000000185 ...
saleid 44950 287 26180 4881 1839 623 3347 7
EDIT: Im using the solution variation below:
chunksize = 1000000
f = 0
for chunk in pd.read_csv('tbl_sale_items.csv',sep=';',dtype={'saleid': np.str, 'upc': np.str}, chunksize=chunksize):
print(f)
t = pd.crosstab(chunk.saleid, chunk.upc)
t.head(3)
t.to_csv('tbl_sales_index_converted_' + str(f) + '.csv.bz2',header=True,sep=';',compression='bz2')
f = f+1
the original file is extremely big to fit to memory after conversion. The above solution has the problem on not having all the columns on all the files as I'm reading chunks from the original file.
Question 2: is there a way to force all chunks to have the same columns?
Option 1
df.groupby(['saleid', 'upc']).size().unstack(fill_value=0)
Option 2
pd.crosstab(df.saleid, df.upc)
Setup
from StringIO import StringIO
import pandas as pd
text = """ saleid upc
0 155_02127453_20090616_135212_0021 02317639000000
1 155_02127453_20090616_135212_0021 00000000000888
2 155_01605733_20090616_135221_0016 00264850000000
3 155_01072401_20090616_135224_0010 02316877000000
4 155_01072401_20090616_135224_0010 05051969277205"""
df = pd.read_csv(StringIO(text), delim_whitespace=True, dtype=str)
df
这篇关于 pandas 枢纽销售表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!