确保键的笛卡尔积出现在 Pandas 表中 [英] ensuring the Cartesian product of keys appears in a Pandas table
问题描述
我有一个包含两个键列的 Pandas 数据框,我想确保表中存在这些键的笛卡尔积(因为我必须制作一个包含所有组合的二维图).我在想出一个相当简短和惯用的方法来做到这一点时遇到了麻烦.
例如,我从这张表开始,给出水果和蔬菜的组合,以及它们的味道:
组合水果蔬菜0 美味的苹果胡萝卜1 根难吃的香蕉胡萝卜2个美味的香蕉生菜3 颗难吃的柠檬生菜
我想得到这个所有可能组合都出现的表格:
水果蔬菜组合0 苹果 胡萝卜 好吃1个苹果生菜未知2个香蕉胡萝卜3 香蕉生菜 好吃4个柠檬胡萝卜未知5 颗柠檬生菜
这是我找到的最好的方法:
将pandas导入为pd# 初始数据df=pd.DataFrame(dict(fruit=['apple','banana','banana','lemon'],蔬菜=['胡萝卜','胡萝卜','生菜','生菜'],组合=['美味','恶心','美味','恶心']))# 解决方案从这里开始veg=df.groupby('veg').size().reset_index()Fruit=df.groupby('fruit').size().reset_index()Fruit[0] = veg[0] = 0 #使用这个虚拟列让连接工作!笛卡尔= pd.merge(水果,蔬菜,how='outer',on=0)德尔笛卡尔[0]all_combos = pd.merge(笛卡尔,df,how='left')all_combos[ pd.isnull(all_combos.combo) ] = '未知'
我想必须有一种更简单且不易出错的方法来做到这一点……有什么建议吗?
如果有人可以向我展示如何使用和不使用包含 fruit
和 veg
列的多索引来执行此操作,我将特别感激,因为我是真的难倒如何用索引做到这一点.根据我的 SQL 经验,我认为这些正是索引适用的情况.
在这个答案之后的某个时候,我将 cartesian_product
添加到了熊猫,不久之后 MultiIndex.from_product
已添加(以下它在另一个问题中的建议).这可以实现以下更有效的简化:
In [21]: p = pd.MultiIndex.from_product(df1.index.levels, names=df1.index.names)在 [22]: df1.reindex(p, fill_value='UNKNOWN')出[22]:组合水果蔬菜苹果 胡萝卜 好吃生菜未知香蕉 胡萝卜 恶心生菜好吃柠檬胡萝卜未知生菜难吃
旧答案如下:
<小时>如果您使用水果和蔬菜作为索引,那么您可以使用 itertools.product
* 创建 MultiIndex
到 reindex
作者:
In [10]: from itertools import product在 [11] 中:df出[11]:组合水果蔬菜0 美味的苹果胡萝卜1 根难吃的香蕉胡萝卜2个美味的香蕉生菜3个令人讨厌的柠檬生菜
棘手的部分是获取所有可能的水果/蔬菜的正确 MultiIndex:
在[12]中:fruit_x_veg = list(product(np.unique(df['fruit']), np.unique(df['veg'])))在 [13] 中:fruit_x_veg = pd.MultiIndex.from_tuples(fruit_x_veg,名称=['水果','蔬菜'])
然后你可以通过这些重新索引:
在[14]中:df1 = df.set_index(['fruit', 'veg'])在 [15] 中:df1出[15]:组合水果蔬菜苹果 胡萝卜 好吃香蕉 胡萝卜 恶心生菜好吃柠檬生菜在 [16]: df1.reindex(fruit_x_veg, fill_value='UNKNOWN')出[16]:组合水果蔬菜苹果 胡萝卜 好吃生菜未知香蕉 胡萝卜 恶心生菜好吃柠檬胡萝卜未知生菜难吃
* 如果 itertools.product
不够快,请考虑使用 这个 numpy 实现>
注意:此实现在 pandas.tools.util.cartesian_product
中进行了扩展,现在支持更多 dtypes(并在 MultiIndex.from_product
代码>).
I have a Pandas dataframe that has two key columns, and I want to ensure that the Cartesian product of those keys exist in the table (because I'll have to make a 2D plot containing all combinations). I'm having trouble coming up with a reasonably brief and idiomatic way to do this.
For example, I start with this table giving combinations of fruits and vegetables, and how they taste together:
combo fruit veg
0 tasty apple carrot
1 yucky banana carrot
2 tasty banana lettuce
3 yucky lemon lettuce
I want to end up with this table in which all possible combinations occur:
fruit veg combo
0 apple carrot tasty
1 apple lettuce UNKNOWN
2 banana carrot yucky
3 banana lettuce tasty
4 lemon carrot UNKNOWN
5 lemon lettuce yucky
Here's the best way I've found to do it:
import pandas as pd
# Initial data
df=pd.DataFrame(dict(fruit=['apple','banana','banana','lemon'],
veg=['carrot','carrot','lettuce','lettuce'],
combo=['tasty','yucky','tasty','yucky']))
# Solution starts here
veg=df.groupby('veg').size().reset_index()
fruit=df.groupby('fruit').size().reset_index()
fruit[0] = veg[0] = 0 #use this dummy column for the join to work!
cartesian = pd.merge(fruit, veg, how='outer', on=0)
del cartesian[0]
all_combos = pd.merge(cartesian, df, how='left')
all_combos[ pd.isnull(all_combos.combo) ] = 'UNKNOWN'
I imagine that there's got to be a simpler and less error-prone way to do this... any advice?
I'd especially appreciate it if someone could show me how to do this both with and without a multi-index containing the fruit
and veg
columns, because I am really stumped about how to do this with indexes. Based on my SQL experience, I'd think these are exactly the situations that indexes are intended for.
Sometime after this answer, I added cartesian_product
to pandas, and soon after MultiIndex.from_product
was added (following its suggestion in another question). This enables the following simplification which is more efficient:
In [21]: p = pd.MultiIndex.from_product(df1.index.levels, names=df1.index.names)
In [22]: df1.reindex(p, fill_value='UNKNOWN')
Out[22]:
combo
fruit veg
apple carrot tasty
lettuce UNKNOWN
banana carrot yucky
lettuce tasty
lemon carrot UNKNOWN
lettuce yucky
The older answer follows:
If you use fruit and veg as the index, then you could use itertools.product
* to create the MultiIndex
to reindex
by:
In [10]: from itertools import product
In [11]: df
Out[11]:
combo fruit veg
0 tasty apple carrot
1 yucky banana carrot
2 tasty banana lettuce
3 yucky lemon lettuce
The tricky part is to grab the right MultiIndex of all the possible fruit/veg:
In [12]: fruit_x_veg = list(product(np.unique(df['fruit']), np.unique(df['veg'])))
In [13]: fruit_x_veg = pd.MultiIndex.from_tuples(fruit_x_veg,
names=['fruit', 'veg'])
Then you can just reindex by these:
In [14]: df1 = df.set_index(['fruit', 'veg'])
In [15]: df1
Out[15]:
combo
fruit veg
apple carrot tasty
banana carrot yucky
lettuce tasty
lemon lettuce yucky
In [16]: df1.reindex(fruit_x_veg, fill_value='UNKNOWN')
Out[16]:
combo
fruit veg
apple carrot tasty
lettuce UNKNOWN
banana carrot yucky
lettuce tasty
lemon carrot UNKNOWN
lettuce yucky
* If itertools.product
is not fast enough consider using this numpy implemention
Note: this implementation was extended in the pandas.tools.util.cartesian_product
, which now supports more dtypes (and is used under the hood in MultiIndex.from_product
).
这篇关于确保键的笛卡尔积出现在 Pandas 表中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!