如何使用Featuretools通过列值从单个数据框中的多个列创建要素? [英] How to use Featuretools to create features from multiple columns in single dataframe by column values?

查看:93
本文介绍了如何使用Featuretools通过列值从单个数据框中的多个列创建要素?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试根据较早的结果来预测足球比赛的结果.我在Windows上运行Python 3.6,并使用Featuretools 0.4.1.

I'm trying to predict results of football matches based on earlier results. I'm running Python 3.6 on Windows and using Featuretools 0.4.1.

比方说,我有以下表示结果历史记录的数据框.

Let's say I have the following dataframe representing history of results.

原始数据名流

使用上面的数据框,我想创建以下数据框,该数据框将作为 X 馈入机器学习算法.请注意,尽管过去有比赛场地,但主队和客队的目标均值仍需按球队计算.有没有办法使用功能工具

Using the dataframe above I want to create the following dataframe which will be fed to machine learning algorithm as X. Note that goal averages for home and away teams need to be calculated by team despite their past match venues. Is there a way to create such a dataframe using Featuretools?

结果数据框

用于模拟转换的Excel文件可以在此处.

Excel file used to simulate the transformation can be found here.

推荐答案

这是一个棘手的功能,但是在Featuretools中大量使用了自定义原语.

This is a tricky feature, but a great usage of a custom primitive in Featuretools.

第一步是将匹配的CSV加载到Featuretools实体集中

The first step is load the CSV of matches into a Featuretools entityset

es = ft.EntitySet()
matches_df = pd.read_csv("./matches.csv")
es.entity_from_dataframe(entity_id="matches",
                         index="match_id",
                         time_index="match_date",
                         dataframe=matches_df)

然后,我们定义一个自定义转换原语,该原语计算最近n场比赛的平均进球数.它具有一个参数,该参数控制过去的比赛次数以及是否为主队或客队计算.有关定义自定义原语的信息,请参见我们的文档此处和<请在href ="https://docs.featuretools.com/guides/advanced_custom_primitives.html" rel ="nofollow noreferrer">此处.

Then we define a custom transform primitive that calculates average goals scored in last n games. it has a parameter that controls the number of past games and whether or not to calculate for the home or away team. Information on defining custom primitives is in our documentation here and here.

from featuretools.variable_types import Numeric, Categorical
from featuretools.primitives import make_trans_primitive

def avg_goals_previous_n_games(home_team, away_team, home_goals, away_goals, which_team=None, n=1):
    # make dataframe so it's easier to work with
    df = pd.DataFrame({
        "home_team": home_team,
        "away_team": away_team,
        "home_goals": home_goals,
        "away_goals": away_goals
        })

    result = []
    for i, current_game in df.iterrows():
        # get the right team for this game
        team = current_game[which_team]

        # find all previous games that have been played
        prev_games =  df.iloc[:i]

        # only get games the team participated in
        participated = prev_games[(prev_games["home_team"] == team) | (prev_games["away_team"] == team)]
        if participated.shape[0] < n:
            result.append(None)
            continue

        # get last n games
        last_n = participated.tail(n)

        # calculate games per game
        goal_as_home = (last_n["home_team"] == team) * last_n["home_goals"]
        goal_as_away = (last_n["away_team"] == team) * last_n["away_goals"]

        # calculate mean across all home and away games
        mean = (goal_as_home + goal_as_away).mean()

        result.append(mean)

    return result

# custom function so the name of the feature prints out correctly
def make_name(self):
    return "%s_goal_last_%d" % (self.kwargs['which_team'], self.kwargs['n'])


AvgGoalPreviousNGames = make_trans_primitive(function=avg_goals_previous_n_games,
                                          input_types=[Categorical, Categorical, Numeric, Numeric],
                                          return_type=Numeric,
                                          cls_attributes={"generate_name": make_name, "uses_full_entity":True})

现在,我们可以使用此原语定义特征.在这种情况下,我们将必须手动进行.

Now we can define features using this primitive. In this case, we will have to do it manually.

input_vars = [es["matches"]["home_team"], es["matches"]["away_team"], es["matches"]["home_goals"], es["matches"]["away_goals"]]
home_team_last1 = AvgGoalPreviousNGames(*input_vars, which_team="home_team", n=1)
home_team_last3 = AvgGoalPreviousNGames(*input_vars, which_team="home_team", n=3)
home_team_last5 = AvgGoalPreviousNGames(*input_vars, which_team="home_team", n=5)
away_team_last1 = AvgGoalPreviousNGames(*input_vars, which_team="away_team", n=1)
away_team_last3 = AvgGoalPreviousNGames(*input_vars, which_team="away_team", n=3)
away_team_last5 = AvgGoalPreviousNGames(*input_vars, which_team="away_team", n=5)

features = [home_team_last1, home_team_last3, home_team_last5,
            away_team_last1, away_team_last3, away_team_last5]

最后,我们可以计算特征矩阵

Finally, we can calculate the feature matrix

fm = ft.calculate_feature_matrix(entityset=es, features=features)

这将返回

          home_team_goal_last_1  home_team_goal_last_3  home_team_goal_last_5  away_team_goal_last_1  away_team_goal_last_3  away_team_goal_last_5
match_id                                                                                                                                          
1                           NaN                    NaN                    NaN                    NaN                    NaN                    NaN
2                           2.0                    NaN                    NaN                    0.0                    NaN                    NaN
3                           1.0                    NaN                    NaN                    0.0                    NaN                    NaN
4                           3.0               1.000000                    NaN                    0.0               1.000000                    NaN
5                           1.0               1.333333                    NaN                    1.0               0.666667                    NaN
6                           2.0               2.000000                    1.2                    0.0               0.333333                    0.8
7                           1.0               0.666667                    0.6                    2.0               1.666667                    1.6
8                           2.0               1.000000                    0.8                    2.0               2.000000                    2.0
9                           0.0               1.000000                    0.8                    1.0               1.666667                    1.6
10                          3.0               2.000000                    2.0                    1.0               1.000000                    0.8
11                          3.0               2.333333                    2.2                    1.0               0.666667                    1.0
12                          2.0               2.666667                    2.2                    2.0               1.333333                    1.2

最后,我们还可以将这些手动定义的特征用作使用深度特征综合的自动化特征工程的输入,这在此处.通过将手动定义的功能作为seed_features传入,ft.dfs将自动堆叠在它们之上.

Finally, we can also use these manually defined features as an input to the automated feature engineering using Deep Feature Synthesis, which is explained here. By passing the manually defined features in as seed_features, ft.dfs will automatically stack on top of them.

fm, feature_defs = ft.dfs(entityset=es, 
                          target_entity="matches",
                          seed_features=features, 
                          agg_primitives=[], 
                          trans_primitives=["day", "month", "year", "weekday", "percentile"])

feature_defs

[<Feature: home_team>,
 <Feature: away_team>,
 <Feature: home_goals>,
 <Feature: away_goals>,
 <Feature: label>,
 <Feature: home_team_goal_last_1>,
 <Feature: home_team_goal_last_3>,
 <Feature: home_team_goal_last_5>,
 <Feature: away_team_goal_last_1>,
 <Feature: away_team_goal_last_3>,
 <Feature: away_team_goal_last_5>,
 <Feature: DAY(match_date)>,
 <Feature: MONTH(match_date)>,
 <Feature: YEAR(match_date)>,
 <Feature: WEEKDAY(match_date)>,
 <Feature: PERCENTILE(home_goals)>,
 <Feature: PERCENTILE(away_goals)>,
 <Feature: PERCENTILE(home_team_goal_last_1)>,
 <Feature: PERCENTILE(home_team_goal_last_3)>,
 <Feature: PERCENTILE(home_team_goal_last_5)>,
 <Feature: PERCENTILE(away_team_goal_last_1)>,
 <Feature: PERCENTILE(away_team_goal_last_3)>,
 <Feature: PERCENTILE(away_team_goal_last_5)>]

特征矩阵为

         home_team away_team  home_goals  away_goals label  home_team_goal_last_1  home_team_goal_last_3  home_team_goal_last_5  away_team_goal_last_1  away_team_goal_last_3  away_team_goal_last_5  DAY(match_date)  MONTH(match_date)  YEAR(match_date)  WEEKDAY(match_date)  PERCENTILE(home_goals)  PERCENTILE(away_goals)  PERCENTILE(home_team_goal_last_1)  PERCENTILE(home_team_goal_last_3)  PERCENTILE(home_team_goal_last_5)  PERCENTILE(away_team_goal_last_1)  PERCENTILE(away_team_goal_last_3)  PERCENTILE(away_team_goal_last_5)
match_id                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
1          Arsenal   Chelsea           2           0     1                    NaN                    NaN                    NaN                    NaN                    NaN                    NaN                1                  1              2014                    2                0.666667                0.166667                                NaN                                NaN                                NaN                                NaN                                NaN                                NaN
2          Arsenal   Chelsea           1           0     1                    2.0                    NaN                    NaN                    0.0                    NaN                    NaN                2                  1              2014                    3                0.333333                0.166667                           0.590909                                NaN                                NaN                           0.227273                                NaN                                NaN
3          Arsenal   Chelsea           0           3     2                    1.0                    NaN                    NaN                    0.0                    NaN                    NaN                3                  1              2014                    4                0.125000                0.958333                           0.272727                                NaN                                NaN                           0.227273                                NaN                                NaN
4          Chelsea   Arsenal           1           1     X                    3.0               1.000000                    NaN                    0.0               1.000000                    NaN                4                  1              2014                    5                0.333333                0.500000                           0.909091                           0.333333                                NaN                           0.227273                           0.500000                                NaN
5          Chelsea   Arsenal           2           0     1                    1.0               1.333333                    NaN                    1.0               0.666667                    NaN                5                  1              2014                    6                0.666667                0.166667                           0.272727                           0.555556                                NaN                           0.590909                           0.277778                                NaN
6          Chelsea   Arsenal           2           1     1                    2.0               2.000000                    1.2                    0.0               0.333333                    0.8                6                  1              2014                    0                0.666667                0.500000                           0.590909                           0.722222                           0.571429                           0.227273                           0.111111                           0.214286
7          Arsenal   Chelsea           2           2     X                    1.0               0.666667                    0.6                    2.0               1.666667                    1.6                7                  1              2014                    1                0.666667                0.791667                           0.272727                           0.111111                           0.142857                           0.909091                           0.833333                           0.785714
8          Arsenal   Chelsea           0           1     2                    2.0               1.000000                    0.8                    2.0               2.000000                    2.0                8                  1              2014                    2                0.125000                0.500000                           0.590909                           0.333333                           0.357143                           0.909091                           1.000000                           1.000000
9          Arsenal   Chelsea           1           3     2                    0.0               1.000000                    0.8                    1.0               1.666667                    1.6                9                  1              2014                    3                0.333333                0.958333                           0.090909                           0.333333                           0.357143                           0.590909                           0.833333                           0.785714
10         Chelsea   Arsenal           3           1     1                    3.0               2.000000                    2.0                    1.0               1.000000                    0.8               10                  1              2014                    4                0.916667                0.500000                           0.909091                           0.722222                           0.714286                           0.590909                           0.500000                           0.214286
11         Chelsea   Arsenal           2           2     X                    3.0               2.333333                    2.2                    1.0               0.666667                    1.0               11                  1              2014                    5                0.666667                0.791667                           0.909091                           0.888889                           0.928571                           0.590909                           0.277778                           0.428571
12         Chelsea   Arsenal           4           1     1                    2.0               2.666667                    2.2                    2.0               1.333333                    1.2               12                  1              2014                    6                1.000000                0.500000                           0.590909                           1.000000                           0.928571                           0.909091                           0.666667                           0.571429

这篇关于如何使用Featuretools通过列值从单个数据框中的多个列创建要素?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆