Pandas Pivot或Groupby用于动态生成的列 [英] Pandas pivot or groupby for dynamically generated columns
问题描述
我在超市有一个带有销售信息的数据框.数据框中的每一行代表一个项目,具有多个特征作为列.原始的DataFrame是这样的:
I have a dataframe with sales information in a supermarket. Each row in the dataframe represents an item, with several characteristics as columns. The original DataFrame is something like this:
In [1]: import pandas as pd
my_data = [{'ticket_number' : '001', 'item' : 'tomato', 'ticket_price' : '21'},
{'ticket_number' : '001', 'item' : 'candy', 'ticket_price' : '21'},
{'ticket_number' : '001', 'item' : 'soup', 'ticket_price' : '21'},
{'ticket_number' : '002', 'item' : 'soup', 'ticket_price' : '12'},
{'ticket_number' : '002', 'item' : 'cola', 'ticket_price' : '12'},
{'ticket_number' : '003', 'item' : 'beef', 'ticket_price' : '56'},
{'ticket_number' : '003', 'item' : 'tomato', 'ticket_price' : '56'},
{'ticket_number' : '003', 'item' : 'pork', 'ticket_price' : '56'}]
df = pd.DataFrame(my_data)
In [2]: df
Out [2]:
ticket_number ticket_price item
0 001 21 tomato
1 001 21 candy
2 001 21 soup
3 002 12 soup
4 002 12 cola
5 003 56 beef
6 003 56 tomato
7 003 56 pork
我需要一个DataFrame,其中每一行代表一张机票,其中包含所有已购买的物品,而机票价格作为列.在此示例中:
I need a DataFrame where each row represents a ticket with all the items bought and the ticket price as columns. In this example:
ticket_number ticket_price item1 item2 item3
0 001 21 tomato candy soup
1 002 12 soup cola
2 003 56 beef tomato pork
我尝试使用df.groupby(ticket_number).item.value_counts()
,但这不会创建新列.我从未使用过pivot_table
,也许它很有用.
I tried using df.groupby(ticket_number).item.value_counts()
, but that does not create new columns. I have never used pivot_table
, maybe it is useful.
任何帮助将不胜感激.
谢谢!
推荐答案
使用groupby为其创建列表的一种可能方法,然后可以将其转换为列:
One possible way to use groupby to make lists of it that can then be turned into columns:
In [24]: res = df.groupby(['ticket_number', 'ticket_price'])['item'].apply(list).apply(pd.Series)
In [25]: res
Out[25]:
0 1 2
ticket_number ticket_price
001 21 tomato candy soup
002 12 soup cola NaN
003 56 beef tomato pork
然后,在清除此结果后:
Then, after cleaning up this result a bit:
In [27]: res.columns = ['item' + str(i + 1) for i in res.columns]
In [29]: res.reset_index()
Out[29]:
ticket_number ticket_price item1 item2 item3
0 001 21 tomato candy soup
1 002 12 soup cola NaN
2 003 56 beef tomato pork
另一种可能的创建新列的方式,该列用groupby.cumcount
编号每个组中的项目:
Another possible way to create a new column which numbers the items in each group with groupby.cumcount
:
In [38]: df['item_number'] = df.groupby('ticket_number').cumcount()
In [39]: df
Out[39]:
item ticket_number ticket_price item_number
0 tomato 001 21 0
1 candy 001 21 1
2 soup 001 21 2
3 soup 002 12 0
4 cola 002 12 1
5 beef 003 56 0
6 tomato 003 56 1
7 pork 003 56 2
然后进行一些重塑:
In [40]: df.set_index(['ticket_number', 'ticket_price', 'item_number']).unstack(-1)
Out[40]:
item
item_number 0 1 2
ticket_number ticket_price
001 21 tomato candy soup
002 12 soup cola NaN
003 56 beef tomato pork
从这里开始,通过一些列名称的清理,您可以实现与上述相同的功能.
From here, with some cleaning of the columns names, you can achieve the same as above.
使用set_index
和untack
的重塑步骤也可以使用pivot_table
:df.pivot_table(columns=['item_number'], index=['ticket_number', 'ticket
_price'], values='item', aggfunc='first')
The reshaping step with set_index
and untack
could also be done with pivot_table
: df.pivot_table(columns=['item_number'], index=['ticket_number', 'ticket
_price'], values='item', aggfunc='first')
这篇关于Pandas Pivot或Groupby用于动态生成的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!