确保键的笛卡尔积出现在 Pandas 表中 [英] ensuring the Cartesian product of keys appears in a Pandas table

查看:70
本文介绍了确保键的笛卡尔积出现在 Pandas 表中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含两个键列的 Pandas 数据框,我想确保表中存在这些键的笛卡尔积(因为我必须制作一个包含所有组合的二维图).我在想出一个相当简短和惯用的方法来做到这一点时遇到了麻烦.

例如,我从这张表开始,给出水果和蔬菜的组合,以及它们的味道:

 组合水果蔬菜0 美味的苹果胡萝卜1 根难吃的香蕉胡萝卜2个美味的香蕉生菜3 颗难吃的柠檬生菜

我想得到这个所有可​​能组合都出现的表格:

 水果蔬菜组合0 苹果 胡萝卜 好吃1个苹果生菜未知2个香蕉胡萝卜3 香蕉生菜 好吃4个柠檬胡萝卜未知5 颗柠檬生菜

这是我找到的最好的方法:

将pandas导入为pd# 初始数据df=pd.DataFrame(dict(fruit=['apple','banana','banana','lemon'],蔬菜=['胡萝卜','胡萝卜','生菜','生菜'],组合=['美味','恶心','美味','恶心']))# 解决方案从这里开始veg=df.groupby('veg').size().reset_index()Fruit=df.groupby('fruit').size().reset_index()Fruit[0] = veg[0] = 0 #使用这个虚拟列让连接工作!笛卡尔= pd.merge(水果,蔬菜,how='outer',on=0)德尔笛卡尔[0]all_combos = pd.merge(笛卡尔,df,how='left')all_combos[ pd.isnull(all_combos.combo) ] = '未知'

我想必须有一种更简单且不易出错的方法来做到这一点……有什么建议吗?

如果有人可以向我展示如何使用和不使用包含 fruitveg 列的多索引来执行此操作,我将特别感激,因为我是真的难倒如何用索引做到这一点.根据我的 SQL 经验,我认为这些正是索引适用的情况.

解决方案

在这个答案之后的某个时候,我将 cartesian_product 添加到了熊猫,不久之后 MultiIndex.from_product 已添加(以下它在另一个问题中的建议).这可以实现以下更有效的简化:

In [21]: p = pd.MultiIndex.from_product(df1.index.levels, names=df1.index.names)在 [22]: df1.reindex(p, fill_value='UNKNOWN')出[22]:组合水果蔬菜苹果 胡萝卜 好吃生菜未知香蕉 胡萝卜 恶心生菜好吃柠檬胡萝卜未知生菜难吃

旧答案如下:

<小时>

如果您使用水果和蔬菜作为索引,那么您可以使用 itertools.product* 创建 MultiIndexreindex 作者:

In [10]: from itertools import product在 [11] 中:df出[11]:组合水果蔬菜0 美味的苹果胡萝卜1 根难吃的香蕉胡萝卜2个美味的香蕉生菜3个令人讨厌的柠檬生菜

棘手的部分是获取所有可能的水果/蔬菜的正确 MultiIndex:

在[12]中:fruit_x_veg = list(product(np.unique(df['fruit']), np.unique(df['veg'])))在 [13] 中:fruit_x_veg = pd.MultiIndex.from_tuples(fruit_x_veg,名称=['水果','蔬菜'])

然后你可以通过这些重新索引:

在[14]中:df1 = df.set_index(['fruit', 'veg'])在 [15] 中:df1出[15]:组合水果蔬菜苹果 胡萝卜 好吃香蕉 胡萝卜 恶心生菜好吃柠檬生菜在 [16]: df1.reindex(fruit_x_veg, fill_value='UNKNOWN')出[16]:组合水果蔬菜苹果 胡萝卜 好吃生菜未知香蕉 胡萝卜 恶心生菜好吃柠檬胡萝卜未知生菜难吃

* 如果 itertools.product 不够快,请考虑使用 这个 numpy 实现

注意:此实现在 pandas.tools.util.cartesian_product 中进行了扩展,现在支持更多 dtypes(并在 MultiIndex.from_product代码>).

I have a Pandas dataframe that has two key columns, and I want to ensure that the Cartesian product of those keys exist in the table (because I'll have to make a 2D plot containing all combinations). I'm having trouble coming up with a reasonably brief and idiomatic way to do this.

For example, I start with this table giving combinations of fruits and vegetables, and how they taste together:

   combo   fruit      veg
0  tasty   apple   carrot
1  yucky  banana   carrot
2  tasty  banana  lettuce
3  yucky   lemon  lettuce

I want to end up with this table in which all possible combinations occur:

    fruit      veg    combo
0   apple   carrot    tasty
1   apple  lettuce  UNKNOWN
2  banana   carrot    yucky
3  banana  lettuce    tasty
4   lemon   carrot  UNKNOWN
5   lemon  lettuce    yucky

Here's the best way I've found to do it:

import pandas as pd

# Initial data
df=pd.DataFrame(dict(fruit=['apple','banana','banana','lemon'],
                     veg=['carrot','carrot','lettuce','lettuce'],
                     combo=['tasty','yucky','tasty','yucky']))

# Solution starts here
veg=df.groupby('veg').size().reset_index()
fruit=df.groupby('fruit').size().reset_index()
fruit[0] = veg[0] = 0    #use this dummy column for the join to work!
cartesian = pd.merge(fruit, veg, how='outer', on=0)
del cartesian[0]
all_combos = pd.merge(cartesian, df, how='left')
all_combos[ pd.isnull(all_combos.combo) ] = 'UNKNOWN'

I imagine that there's got to be a simpler and less error-prone way to do this... any advice?

I'd especially appreciate it if someone could show me how to do this both with and without a multi-index containing the fruit and veg columns, because I am really stumped about how to do this with indexes. Based on my SQL experience, I'd think these are exactly the situations that indexes are intended for.

解决方案

Sometime after this answer, I added cartesian_product to pandas, and soon after MultiIndex.from_product was added (following its suggestion in another question). This enables the following simplification which is more efficient:

In [21]: p = pd.MultiIndex.from_product(df1.index.levels, names=df1.index.names)

In [22]: df1.reindex(p, fill_value='UNKNOWN')
Out[22]:
                  combo
fruit  veg
apple  carrot     tasty
       lettuce  UNKNOWN
banana carrot     yucky
       lettuce    tasty
lemon  carrot   UNKNOWN
       lettuce    yucky

The older answer follows:


If you use fruit and veg as the index, then you could use itertools.product* to create the MultiIndex to reindex by:

In [10]: from itertools import product

In [11]: df
Out[11]:
   combo   fruit      veg
0  tasty   apple   carrot
1  yucky  banana   carrot
2  tasty  banana  lettuce
3  yucky   lemon  lettuce

The tricky part is to grab the right MultiIndex of all the possible fruit/veg:

In [12]: fruit_x_veg = list(product(np.unique(df['fruit']), np.unique(df['veg'])))

In [13]: fruit_x_veg = pd.MultiIndex.from_tuples(fruit_x_veg,
                                                 names=['fruit', 'veg'])

Then you can just reindex by these:

In [14]: df1 = df.set_index(['fruit', 'veg'])

In [15]: df1
Out[15]:
                combo
fruit  veg
apple  carrot   tasty
banana carrot   yucky
       lettuce  tasty
lemon  lettuce  yucky

In [16]: df1.reindex(fruit_x_veg, fill_value='UNKNOWN')
Out[16]:
                  combo
fruit  veg
apple  carrot     tasty
       lettuce  UNKNOWN
banana carrot     yucky
       lettuce    tasty
lemon  carrot   UNKNOWN
       lettuce    yucky

* If itertools.product is not fast enough consider using this numpy implemention

Note: this implementation was extended in the pandas.tools.util.cartesian_product, which now supports more dtypes (and is used under the hood in MultiIndex.from_product).

这篇关于确保键的笛卡尔积出现在 Pandas 表中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆