是否应该在Featuretools中从DFS中排除目标变量? [英] Should we exclude target variable from DFS in featuretools?

查看:133
本文介绍了是否应该在Featuretools中从DFS中排除目标变量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在将数据帧作为实体集中的实体传递并使用DFS的同时,我们是否应该从DFS中排除目标变量?我有一个模型,在传统的特征选择方法手动尝试并使用特征工具来查看它是否提高了得分之后,其得分为0.76 roc_auc.因此,在实体集上使用的DFS也包括目标变量.令人惊讶的是,roc_auc得分高达0.996,准确性达到0.9997,因此我对分数也感到怀疑,因为我将目标变量以及深度目标综合信息也传递给了目标,有关目标的信息可能已经泄漏给了培训?我假设正确吗?

解决方案

深度特征综合和Featuretools确实可以将目标保留在实体集中(以便使用其历史值创建新特征),但是您需要设置时间索引"并使用截止时间"来做到这一点,而不会导致标签泄漏.

您使用时间索引来指定一列,该列保存着何时知道每一行中的数据的值.使用entity_from_dataframe创建实体时,使用time_index关键字参数指定此列.

然后,在运行ft.dfs()ft.calculate_feature_matrix()时使用截止时间来指定计算特征矩阵的每一行时应使用数据的最后时间.特征计算将仅使用截止时间(包括截止时间)的数据.因此,如果此截止时间早于目标的时间指标值,则不会有标签泄漏.

您可以在处理时间上的文档中详细了解这些概念./p>

如果您完全不希望处理目标,则

  1. 在将其设为实体之前,可以使用pandas将其完全从数据框中删除.如果它不在实体集中,则不能用于创建要素.

  2. 您可以将ft.dfs中的drop_contains关键字参数设置为['target'].这将阻止创建包含字符串'target'的任何功能.

无论您使用上述哪个选项,都仍然可以直接通过DFS传递目标列.如果将目标添加到截止时间数据框中,则该目标将传递到生成的特征矩阵.这很有用,因为它可以确保目标列与其他功能保持对齐.您可以举一个通过来传递标签的示例.在文档中.

使用辅助时间索引的高级解决方案

有时,单个时间索引不足以表示在两个不同时间连续知道某行信息的数据集.当目标是列时,通常会发生这种情况.为了处理这种情况,我们需要使用第二时间索引".

此处是Kaggle内核关于预测的示例当患者将错过使用二级时间指数的医生预约时.安排约会时,数据集具有scheduled_time和约会实际发生时的appointment_day.我们想告诉Featuretools,他们在安排约会时就知道诸如患者的年龄之类的某些信息,但是直到约会之日才知道诸如患者是否真正出诊之类的其他信息.

为此,我们创建一个具有辅助时间索引的appointments实体,如下所示:

es = ft.EntitySet('Appointments')
es = es.entity_from_dataframe(entity_id="appointments",
                              dataframe=data,
                              index='appointment_id',
                              time_index='scheduled_time',
                              secondary_time_index={'appointment_day': ['no_show', 'sms_received']})

这表示大多数列都可以在时间索引scheduled_time中使用,但是变量no_showsms_received只能在辅助时间索引中的值之前使用.

然后我们通过将截止时间设置为scheduled_time进行预测

cutoff_times = es['appointments'].df[['appointment_id', 'scheduled_time', 'no_show']]

通过将该数据帧传递到DFS中,no_show列将原封不动地传递,但是no_show的历史值仍可用于创建要素.例如,例如ages.PERCENT_TRUE(appointments.no_show)或过去未出现过的每个年龄段的人口百分比".

While passing the dataframes as entities in an entityset and use DFS on that, are we supposed to exclude target variable from the DFS? I have a model that had 0.76 roc_auc score after traditional feature selection methods tried manually and used feature tools to see if it improves the score. So used DFS on entityset that included target variable as well. Surprisingly, the roc_auc score went up to 0.996 and accuracy to 0.9997 and so i am doubtful of the scores as i passed target variable as well into Deep Feature Synthesis and there the infor related to the target might have been leaked to the training? Am i assuming correct?

解决方案

Deep Feature Synthesis and Featuretools do allow you to keep your target in your entity set (in order to create new features using historical values of it), but you need to set up the "time index" and use "cutoff times" to do this without label leakage.

You use the time index to specify the column that holds the value for when data in each row became known. This column is specified using the time_index keyword argument when creating the entity using entity_from_dataframe.

Then, you use cutoff times when running ft.dfs() or ft.calculate_feature_matrix() to specify the last point in time you should use data when calculating each row of your feature matrix. Feature calculation will only use data up to and including the cutoff time. So, if this cutoff time is before the time index value of your target, you won’t have label leakage.

You can read about those concepts in detail in the documentation on Handling Time.

If you don’t want to deal with the target at all you can

  1. You can use pandas to drop it out of your dataframe entirely before making it an entity. If it’s not in the entityset, it can’t be used to create features.

  2. You can set the drop_contains keyword argument in ft.dfs to ['target']. This stops any feature from being created which includes the string 'target'.

No matter which of the above options you do, it is still possible to pass a target column directly through DFS. If you add the target to your cutoff times dataframe it is passed through to the resulting feature matrix. That can be useful because it ensures the target column remains aligned with the other features. You can an example of passing the label through here in the documentation.

Advanced Solution using Secondary Time Index

Sometimes a single time index isn’t enough to represent datasets where information in a row became known at two different times. This commonly occurs when the target is a column. To handle this situation, we need to use a "secondary time index".

Here is an example from a Kaggle kernel on predicting when a patient will miss an appointment with a doctor where a secondary time index is used. The dataset has a scheduled_time, when the appointment is scheduled, and an appointment_day, which is when the appointment actually happens. We want to tell Featuretools that some information like the patient’s age is known when they schedule the appointment, but other information like whether or not a patient actually showed up isn't known until the day of the appointment.

To do this, we create an appointments entity with a secondary time index as follows:

es = ft.EntitySet('Appointments')
es = es.entity_from_dataframe(entity_id="appointments",
                              dataframe=data,
                              index='appointment_id',
                              time_index='scheduled_time',
                              secondary_time_index={'appointment_day': ['no_show', 'sms_received']})

This says that most columns can be used at the time index scheduled_time, but that the variables no_show and sms_received can’t be used until the value in secondary time index.

We then make predictions at the scheduled_time by setting our cutoff times to be

cutoff_times = es['appointments'].df[['appointment_id', 'scheduled_time', 'no_show']]

By passing that dataframe into DFS, the no_show column will be passed through untouched, but while historical values of no_show can still be used to create features. An example would be something like ages.PERCENT_TRUE(appointments.no_show) or "the percentage of people of each age that have not shown up in the past".

这篇关于是否应该在Featuretools中从DFS中排除目标变量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆