如何通过FeatureTools中的多个ID创建新变量? [英] How to create new variables by multiple ids in featuretools?

查看：53 发布时间：2021/5/13 19:48:14 python pandas group-by feature-engineering featuretools

本文介绍了如何通过FeatureTools中的多个ID创建新变量?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个数据集，该数据集的每个成员和每个事务都有一行，并且购买的商店可能来自"brand_id".我想使用featuretools进行输出，每个成员只有一行，每个品牌ID的总收入为收入".

I have a dataset that has one row per member and per transaction, and there are different stores the purchase could have came from 'brand_id'. I want to use featuretools to make output that would have one row per member, with an aggregate of 'revenue' by each brand id.

我想要什么:

import featuretools as ft
import pandas as pd

df = pd.DataFrame({'member_id': [1,1,1,1,2,2,3,4,4,4,4,5,5,5],
                   'transaction_id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14],
                   'brand_id': ['A','A','B','B','B','B','A','B','A','B','B','A','B','A'],
                   'revenue': [32,124,54,12,512,51,12,4,12,412,512,14,89,12]
                  })

df2 = df.groupby(
            ['member_id',
             'brand_id']
        ).agg({
            'transaction_id': 'count',
            'revenue' :['sum', 'mean']}
        ).reset_index()

df2.columns = ['member_id', 'brand_id', 'transactions', 'revenue_sum', 'revenue_mean']

df2 = df2.pivot(index='member_id', 
                columns='brand_id', 
                values=['transactions',
                      'revenue_sum',
                      'revenue_mean']
               ).fillna(0
               ).reset_index()

groups = ['A', 'B']
df2.columns = ['memberid'] + \
              [x + '_transactions_count' for x in groups] + \
              [x + '_revenue_sum' for x in groups] + \
              [x + '_revenue_mean' for x in groups]

输出内容如下:

这是我尝试使用功能工具的尝试，但是无论我尝试了什么，我都无法创建由'brand_id'的每个唯一值分解的新变量.

Here is my attempt using feature tools, but no matter what I have tried I cannot get it to create new variables broken out by each unique value of 'brand_id'.

es = ft.EntitySet(id = 'my_set')
es.entity_from_dataframe(entity_id='members',
                        index='member_id',
                        dataframe = pd.DataFrame({'member_id':[1,2,3,4,5]})
                        )

es.entity_from_dataframe(entity_id='trans',
                        index='transaction_id',
                        variable_types = {'brand_id': ft.variable_types.Id},
                        dataframe=df.copy()
                        )
# create the relationship
r_member_accrual = ft.Relationship(es['members']['member_id'],
                                   es['trans']['member_id'])
# add the relationship to the entity set
es = es.add_relationship(r_member_accrual)

fm, fl = ft.dfs(target_entity='members',
               entityset=es,
               agg_primitives=['sum','mean','count'],
               groupby_trans_primitives=["cum_sum"],
               primitive_options={
                   'cum_sum': {
                       'ignore_groupby_variables': {'trans':['member_id']}
                   }
               }
               )

这是我得到的结果，这不是我想要的结果.

Here is the result that I get, which is not what I am going for.

推荐答案

您可以将 interesting_values 与 where_primitives 一起使用，以汇总每个成员和品牌的收入.我将遍历该示例.首先，我们使用交易作为基础实体来创建实体集.

You can use interesting_values with where_primitives to aggregate revenue per member and brand. I'll walk through the example. First, we create the entity set by using transactions as the base entity.

es = ft.EntitySet()

es.entity_from_dataframe(
    entity_id='transactions',
    index='transaction_id',
    dataframe=df,
)

然后，我们将成员和品牌的实体标准化.这将自动创建关系.

Then, we normalize entities for members and brands. This will automatically create the relationships.

es.normalize_entity(
    base_entity_id='transactions',
    new_entity_id='brands',
    index='brand_id',
)

es.normalize_entity(
    base_entity_id='transactions',
    new_entity_id='members',
    index='member_id',
)

现在，我们在交易中将有趣的值添加到 brand_id 中.这使我们能够创建以品牌价值为条件的功能.

Now, we add the interesting values to brand_id in transactions. This allows us to create features conditioned on brand values.

es['transactions']['brand_id'].interesting_values = ['A', 'B']

最后，我们使用 where_primitives 运行DFS，以指定哪些聚合原语可以创建基于有趣值的特征.

Lastly, we run DFS using where_primitives to specify which aggregation primitives can create features conditioned on interesting values.

fm, fl = ft.dfs(
    target_entity='members',
    entityset=es,
    agg_primitives=['sum', 'mean', 'count'],
    where_primitives=['sum', 'mean', 'count'],
)

fm.filter(regex='WHERE').round(1).T

member_id                                          1      2     3      4     5
COUNT(transactions WHERE brand_id = A)           2.0    0.0   1.0    1.0   2.0
COUNT(transactions WHERE brand_id = B)           2.0    2.0   0.0    3.0   1.0
SUM(transactions.revenue WHERE brand_id = A)   156.0    0.0  12.0   12.0  26.0
SUM(transactions.revenue WHERE brand_id = B)    66.0  563.0   0.0  928.0  89.0
MEAN(transactions.revenue WHERE brand_id = A)   78.0    NaN  12.0   12.0  13.0
MEAN(transactions.revenue WHERE brand_id = B)   33.0  281.5   NaN  309.3  89.0

让我知道这是否有帮助.

Let me know if this helps.

这篇关于如何通过FeatureTools中的多个ID创建新变量?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何通过FeatureTools中的多个ID创建新变量? [英] How to create new variables by multiple ids in featuretools?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何通过FeatureTools中的多个ID创建新变量? [英] How to create new variables by multiple ids in featuretools?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭