如何将深度特征综合应用于单个表 [英] How to apply Deep Feature Synthesis to a single table

查看:128
本文介绍了如何将深度特征综合应用于单个表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

处理后,我的数据是一张表,其中有几列是要素,而另一列是标签.我想使用featuretools.dfs来帮助我预测标签.是否可以直接执行此操作,还是我需要将单个表拆分为多个表?

After processing, my data is one table with several columns that are features and one column which is a label. I would like to use featuretools.dfs to help me predict the label. Is it possible to do it directly, or do I need to split my single table into multiple?

推荐答案

可以在单个表上运行DFS.例如,如果您的熊猫数据框df的索引为'index',则应编写:

It is possible to run DFS on a single table. As an example, if you have a pandas dataframe df with index 'index', you would write:

import featuretools as ft
es = ft.EntitySet('Transactions')

es.entity_from_dataframe(dataframe=df,
                         entity_id='log',
                         index='index')

fm, features = ft.dfs(entityset=es, 
                      target_entity='log',
                      trans_primitives=['day', 'weekday', 'month'])

生成的特征矩阵看起来像

The generated feature matrix will look like

In [1]: fm
Out[1]: 
             location  pies sold  WEEKDAY(date)  MONTH(date)  DAY(date)
index                                                                  
1         main street          3              4           12         29
2         main street          4              5           12         30
3         main street          5              6           12         31
4      arlington ave.         18              0            1          1
5      arlington ave.          1              1            1          2

这会将转换"原语应用于您的数据.为了使用聚合原语,通常需要添加更多实体以提供ft.dfs.您可以在我们的文档中了解其区别.

This will apply "transform" primitives to your data. You usually want to add more entities to give ft.dfs, in order to use aggregation primitives. You can read about the difference in our documentation.

标准工作流程是通过以下方式规范化一个有趣的分类.如果您的df是单个表

A standard workflow is to normalize your single entity by an interesting categorical. If your df was the single table

| index | location       | pies sold |   date |
|-------+----------------+-------+------------|
|     1 | main street    |     3 | 2017-12-29 |
|     2 | main street    |     4 | 2017-12-30 |
|     3 | main street    |     5 | 2017-12-31 |
|     4 | arlington ave. |    18 | 2018-01-01 |
|     5 | arlington ave. |     1 | 2018-01-02 |

您可能会对通过location进行规范化感兴趣:

you would probably be interested in normalizing by location:

es.normalize_entity(base_entity_id='log',
                    new_entity_id='locations',
                    index='location')

您的新实体locations将具有该表

| location       | first_log_time |
|----------------+----------------|
| main street    |     2018-12-29 |
| arlington ave. |     2000-01-01 |

可使locations.SUM(log.pies sold)locations.MEAN(log.pies sold)之类的功能按位置添加或平均所有值.您可以在下面的示例中看到创建的这些功能

which would make features like locations.SUM(log.pies sold) or locations.MEAN(log.pies sold) to add or average all values by location. You can see these features created in the example below

In [1]: import pandas as pd
   ...: import featuretools as ft
   ...: df = pd.DataFrame({'index': [1, 2, 3, 4, 5],
   ...:                    'location': ['main street',
   ...:                                 'main street',
   ...:                                 'main street',
   ...:                                 'arlington ave.',
   ...:                                 'arlington ave.'],
   ...:                    'pies sold': [3, 4, 5, 18, 1]})
   ...: df['date'] = pd.date_range('12/29/2017', periods=5, freq='D')
   ...: df
   ...: 

Out[1]: 
   index        location  pies sold       date
0      1     main street          3 2017-12-29
1      2     main street          4 2017-12-30
2      3     main street          5 2017-12-31
3      4  arlington ave.         18 2018-01-01
4      5  arlington ave.          1 2018-01-02

In [2]: es = ft.EntitySet('Transactions')
   ...: es.entity_from_dataframe(dataframe=df, entity_id='log', index='index', t
   ...: ime_index='date')
   ...: es.normalize_entity(base_entity_id='log', new_entity_id='locations', ind
   ...: ex='location')
   ...: 
Out[2]: 
Entityset: Transactions
  Entities:
    log [Rows: 5, Columns: 4]
    locations [Rows: 2, Columns: 2]
  Relationships:
    log.location -> locations.location

In [3]: fm, features = ft.dfs(entityset=es,
   ...:                       target_entity='log',
   ...:                       agg_primitives=['sum', 'mean'],
   ...:                       trans_primitives=['day'])
   ...: fm
   ...: 
Out[3]: 
             location  pies sold  DAY(date)  locations.DAY(first_log_time)  locations.MEAN(log.pies sold)  locations.SUM(log.pies sold)
index                                                                                                                                  
1         main street          3         29                             29                            4.0                            12
2         main street          4         30                             29                            4.0                            12
3         main street          5         31                             29                            4.0                            12
4      arlington ave.         18          1                              1                            9.5                            19
5      arlington ave.          1          2                              1                            9.5                            19

这篇关于如何将深度特征综合应用于单个表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆