如何建立RF(随机森林)和PSO(粒子群优化器)的混合模型以找到最优的产品折扣? [英] How to build hybrid model of RF(Random Forest) and PSO(Particle Swarm Optimizer) to find optimal discount of products?

查看:446
本文介绍了如何建立RF(随机森林)和PSO(粒子群优化器)的混合模型以找到最优的产品折扣?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要找到每种产品的最佳折扣(例如A,B,C),以便我能使总销售额最大化.对于每种产品,我都有现有的随机森林"模型,这些模型将折扣和季节映射到销售额.如何合并这些模型并将其馈送到优化器,以找到每种产品的最佳折扣?

选择模型的原因:

  1. RF:它能够在预测变量和响应(sales_uplift_norm)之间提供更好的(无线性模型)关系.
  2. PSO:在许多白皮书中都有建议(可在researchgate/IEEE上找到),以及在python

    我跟随的想法/步骤:

    1. 为每种产品构建RF模型

     #预处理数据products_pre_processed_data = {key:pre_process_data(df,key)用于df_basepack_dict.items()中的密钥df#射频型号products_rf_model = {keys:rf_fit(df),表示products_pre_processed_data .items()中的df 

    • 将模型传递给优化器
      • 目标函数:最大化 sales_uplift_norm (RF模型的响应变量)
      • 约束:
        • 总支出(支出A + B + C <= 20),支出= total_units_sold_of_products *折扣百分比* mrp_of_products
        • 产品(A,B,C)的下界:[0.0,0.0,0.0]#折扣百分比下界
        • 产品(A,B,C)的上限:[0.3,0.4,0.4]#折扣百分比上限

    sudo/示例代码#,因为我找不到将product_models传递给优化器的方法.

    从pyswarm导入pso的

     定义对象(x):型号1 = products_rf_model.get('A')模型2 = products_rf_model.get('B')模型3 = products_rf_model.get('C')return-(model1 + model2 + model3)#-ve表示最大化def con(x):x1 = x [0]x2 = x [1]x3 = x [2]return np.sum(units_A * x * mrp_A + units_B * x * mrp_B + units_C * x * spend_C)-20#支出预算磅= [0.0,0.0,0.0]ub = [0.3,0.4,0.4]xopt,fopt = pso(obj,lb,ub,f_ieqcons = con) 

    尊敬的SO专家,请就如何使用 PSO优化器(或其他优化器,如果我没有遵循正确的方法)的问题,寻求您的指导(几周以来一直在寻找任何指导))与射频.

    添加用于模型的功能:

    def pre_process_data(df,product):数据= df.copy().reset_index()#打印(数据)bp =产品print("----------产品:{} ----------".format(bp))#预处理步骤打印(预处理df.shape {}".format(df.shape))#1.响应var转换响应= data.sales_uplift_norm#已转换#2.预测变量数值var变换numeric_vars = ['discount_percentage']#可能包含mrp,深度df_numeric =数据[numeric_vars]df_norm = df_numeric.apply(lambda x:scale(x),axis = 0)#中心和比例#3.char字段实体化#选择类别字段cat_cols = data.select_dtypes('category').columns#选择字符串字段str_to_cat_cols = data.drop(['product'],axis = 1).select_dtypes('object').astype('category').columns#合并所有分类字段all_cat_cols = [*cat_cols,*str_to_cat_cols]#打印(all_cat_cols)#将猫转换成假人df_dummies = pd.get_dummies(data [all_cat_cols])#4.将num和char df结合在一起df_combined = pd.concat([df_dummies.reset_index(drop = True),df_norm.reset_index(drop = True)],轴= 1)df_combined ['sales_uplift_norm'] =回应df_processed = df_combined.copy()print(后期处理df.shape {}".format(df_processed.shape))#print(模型字段:{}".format(df_processed.columns))返回(df_processed)def rf_fit(df,random_state = 12):train_features = df.drop('sales_uplift_norm',轴= 1)train_labels = df ['sales_uplift_norm']#随机森林回归rf = RandomForestRegressor(n_estimators = 500,random_state = random_state,bootstrap = True,oob_score = True)#射频模型rf_fit = rf.fit(train_features,train_labels)返回(rf_fit) 

    编辑:将数据集更新为简化版本.

    解决方案

    您可以在下面找到完整的解决方案!

    与您的方法的基本区别如下:

    1. 由于随机森林模型将季节功能作为输入,因此必须为每个季节计算最佳折扣.
    2. 检查 pyswarm 的文档, con 函数会产生输出,必须符合 con(x)> = 0.0 .因此,正确的约束是 20-sum(...),而不是相反.另外,没有给出 units mrp 变量;我只是假设值为1,您可能想更改这些值.

    对原始代码的其他修改包括:

    1. sklearn 的预处理和管道包装程序,以简化预处理步骤.
    2. 最佳参数存储在输出 .xlsx 文件中.
    3. PSO的 maxiter 参数已设置为 5 以加快调试速度,您可能希望将其值设置为另一个值(默认值= 100 ).

    因此,代码为:

     将熊猫作为pd导入从sklearn.pipeline导入管道从sklearn.preprocessing导入OneHotEncoder,StandardScaler从sklearn.compose导入ColumnTransformer从sklearn.ensemble导入RandomForestRegressor从sklearn.base导入克隆#===================== RF训练=======================#预处理def build_sample(季节,折价百分比):返回pd.DataFrame({'季节':[季节],'discount_percentage':[discount_percentage]})columns_to_encode = [季节"]column_to_scale = ["discount_percentage"]编码器= OneHotEncoder()定标器= StandardScaler()preproc = ColumnTransformer(变压器= [("encoder",管道([("OneHotEncoder",encoder))),columns_to_encode),(缩放器",管道([("StandardScaler",缩放器)]),columns_to_scale)])# 模型myRFClassifier = RandomForestRegressor(n_estimators = 500,random_state = 12bootstrap = True,oob_score = True)pipeline_list = [("preproc",preproc),("clf",myRFClassifier)]管道=管道(pipeline_list)#数据集df_tot = pd.read_excel(so_data.xlsx")df_dict = {product:df_tot [df_tot ['product'] == product] .drop(columns = ['product'])for pd.unique(df_tot ['product'])}# 合身打印(培训...")pipe_dict = {产品:df_dict.keys()中产品的clone(pipe)}对于产品,df_dict.items()中的df:X = df.drop(columns = ["sales_uplift_norm"])y = df ["sales_uplift_norm"]pipe_dict [product] .fit(X,y)#======================优化======================从pyswarm导入pso#PSO的参数最大= 5n_product = len(pipe_dict.keys())#约束预算= 20单位= [1,1,1]mrp = [1,1,1]磅= [0.0,0.0,0.0]ub = [0.3,0.4,0.4]#必须始终保持> = 0def con(x):s = 0对于我在范围内(n_product):s + =单位[i] * mrp [i] * x [i]退货预算-s打印(优化...")#为每个产品和每个季节节省最佳折扣df_opti = pd.DataFrame(数据=无,列= df_tot.columns)对于pd.unique(df_tot ['season'])中的季节:#目标函数以最小化定义对象(x):s = 0对于我来说,枚举(pipe_dict.keys())中的乘积:s + = pipe_dict [product] .predict(build_sample(season,x [i]))返回-s#PSOxopt,fopt = pso(obj,lb,ub,f_ieqcons = con,maxiter = maxiter)print("Season:{} \ t xopt:{}".format(season,xopt))#存储结果df_opti = pd.concat([df_opti,pd.DataFrame({'产品':列表(pipe_dict.keys()),'季节':[季节] * n_product,'discount_percentage':xopt,'sales_uplift_norm':[pipe_dict [product] .predict(build_sample(season,xopt [i]))[0] for i,枚举中的乘积(pipe_dict.keys())]})])#保存结果df_opti = df_opti.reset_index().drop(columns = ['index'])df_opti.to_excel("so_result.xlsx")打印(摘要")打印(df_opti) 

    它给出了:

     培训...优化 ...停止搜索:已达到最大迭代次数->5季节:夏季xopt:[0.1941521 0.11233673 0.36548761]停止搜索:已达到最大迭代次数->5季节:冬季xopt:[0.18670604 0.37829516 0.21857777]停止搜索:已达到最大迭代次数->5季节:季风xopt:[0.14898102 0.39847885 0.18889792]概括产品季节折扣_销售百分比_提升_标准0一个夏天0.194152 0.1759731 B夏季0.112337 0.2297352 C夏季0.365488 0.3745103冬季0.186706 -0.0282054 B冬季0.378295 0.2666755 C冬季0.218578 0.1460126季风0.148981 0.1990737 B 季风 0.398479 0.3076328 C季风0.188898 0.210134 

    I need to find optimal discount for each product (in e.g. A, B, C) so that I can maximize total sales. I have existing Random Forest models for each product that map discount and season to sales. How do I combine these models and feed them to an optimiser to find the optimum discount per product?

    Reason for model selection:

    1. RF: it's able to give better(w.r.t linear models) relation between predictors and response(sales_uplift_norm).
    2. PSO: suggested in many white papers(available at researchgate/IEEE), also availability of the package in python here and here.

    Input data: sample data used to build model at product level. Glance of the data as below:

    Idea/Steps followed by me:

    1. Build RF model per products

        # pre-processed data
        products_pre_processed_data = {key:pre_process_data(df, key) for key, df in df_basepack_dict.items()}
        # rf models
        products_rf_model = {key:rf_fit(df) for key, df in products_pre_processed_data .items()}
    

    • Pass the model to optimizer
      • Objective function: maximize sales_uplift_norm (the response variable of RF model)
      • Constraint:
        • total spend(spends of A + B + C <= 20), spends = total_units_sold_of_products * discount_percentage * mrp_of_products
        • lower bound of products(A, B, C): [0.0, 0.0, 0.0] # discount percentage lower bounds
        • upper bound of products(A, B, C): [0.3, 0.4, 0.4] # discount percentage upper bounds

    sudo/sample code # as I am unable to find a way to pass the product_models into optimizer.

    from pyswarm import pso
    def obj(x):
        model1 = products_rf_model.get('A')
        model2 = products_rf_model.get('B')
        model3 = products_rf_model.get('C')
        return -(model1 + model2 + model3) # -ve sign as to maximize
    
    def con(x):
        x1 = x[0]
        x2 = x[1]
        x3 = x[2]
        return np.sum(units_A*x*mrp_A + units_B*x*mrp_B + units_C* x *spend_C)-20 # spend budget
    
    lb = [0.0, 0.0, 0.0]
    ub = [0.3, 0.4, 0.4]
    
    xopt, fopt = pso(obj, lb, ub, f_ieqcons=con)
    

    Dear SO experts, Request your guidance(struggling to find any guidance since couple of weeks) on how to use the PSO optimizer(or any other optimizer if I am not following right one) with RF.

    Adding functions used for model:

    def pre_process_data(df,product):
        data = df.copy().reset_index()
    #     print(data)
        bp = product
        print("----------product: {}----------".format(bp))
        # Pre-processing steps
        print("pre process df.shape {}".format(df.shape))
            #1. Reponse var transformation
        response = data.sales_uplift_norm # already transformed
    
            #2. predictor numeric var transformation 
        numeric_vars = ['discount_percentage'] # may include mrp, depth
        df_numeric = data[numeric_vars]
        df_norm = df_numeric.apply(lambda x: scale(x), axis = 0) # center and scale
    
            #3. char fields dummification
        #select category fields
        cat_cols = data.select_dtypes('category').columns
        #select string fields
        str_to_cat_cols = data.drop(['product'], axis = 1).select_dtypes('object').astype('category').columns
        # combine all categorical fields
        all_cat_cols = [*cat_cols,*str_to_cat_cols]
    #     print(all_cat_cols)
    
        #convert cat to dummies
        df_dummies = pd.get_dummies(data[all_cat_cols])
    
            #4. combine num and char df together
        df_combined = pd.concat([df_dummies.reset_index(drop=True), df_norm.reset_index(drop=True)], axis=1)
        
        df_combined['sales_uplift_norm'] = response
        df_processed = df_combined.copy()
        print("post process df.shape {}".format(df_processed.shape))
    #     print("model fields: {}".format(df_processed.columns))
        return(df_processed)
    
    
    def rf_fit(df, random_state = 12):
        
        train_features = df.drop('sales_uplift_norm', axis = 1)
        train_labels = df['sales_uplift_norm']
        
        # Random Forest Regressor
        rf = RandomForestRegressor(n_estimators = 500,
                                   random_state = random_state,
                                   bootstrap = True,
                                   oob_score=True)
        # RF model
        rf_fit = rf.fit(train_features, train_labels)
    
        return(rf_fit)
    
    

    EDIT: updated dataset to simplified version.

    解决方案

    you can find a complete solution below !

    The fundamental differences with your approach are the following :

    1. Since the Random Forest model takes as input the season feature, optimal discounts must be computed for every season.
    2. Inspecting the documentation of pyswarm, the con function yields an output that must comply with con(x) >= 0.0. The correct constraint is therefore 20 - sum(...) and not the other way around. In addition, the units and mrp variable were not given ; I just assumed a value of 1, you might want to change those values.

    Additional modifications to your original code include :

    1. Preprocessing and pipeline wrappers of sklearn in order to simplify the preprocessing steps.
    2. Optimal parameters are stored in an output .xlsx file.
    3. The maxiter parameter of the PSO has been set to 5 to speed-up debugging, you might want to set its value to another one (default = 100).

    The code is therefore :

    import pandas as pd 
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import OneHotEncoder, StandardScaler
    from sklearn.compose import ColumnTransformer
    from sklearn.ensemble import RandomForestRegressor 
    from sklearn.base import clone
    
    # ====================== RF TRAINING ======================
    # Preprocessing
    def build_sample(season, discount_percentage):
        return pd.DataFrame({
            'season': [season],
            'discount_percentage': [discount_percentage]
        })
    
    columns_to_encode = ["season"]
    columns_to_scale = ["discount_percentage"]
    encoder = OneHotEncoder()
    scaler = StandardScaler()
    preproc = ColumnTransformer(
        transformers=[
            ("encoder", Pipeline([("OneHotEncoder", encoder)]), columns_to_encode),
            ("scaler", Pipeline([("StandardScaler", scaler)]), columns_to_scale)
        ]
    )
    
    # Model
    myRFClassifier = RandomForestRegressor(
        n_estimators = 500,
        random_state = 12,
        bootstrap = True,
        oob_score = True)
    
    pipeline_list = [
        ('preproc', preproc),
        ('clf', myRFClassifier)
    ]
    
    pipe = Pipeline(pipeline_list)
    
    # Dataset
    df_tot = pd.read_excel("so_data.xlsx")
    df_dict = {
        product: df_tot[df_tot['product'] == product].drop(columns=['product']) for product in pd.unique(df_tot['product'])
    }
    
    # Fit
    print("Training ...")
    pipe_dict = {
        product: clone(pipe) for product in df_dict.keys()
    }
    
    for product, df in df_dict.items():
        X = df.drop(columns=["sales_uplift_norm"])
        y = df["sales_uplift_norm"]
        pipe_dict[product].fit(X,y)
    
    # ====================== OPTIMIZATION ====================== 
    from pyswarm import pso
    # Parameter of PSO
    maxiter = 5
    
    n_product = len(pipe_dict.keys())
    
    # Constraints
    budget = 20
    units  = [1, 1, 1]
    mrp    = [1, 1, 1]
    
    lb = [0.0, 0.0, 0.0]
    ub = [0.3, 0.4, 0.4]
    
    # Must always remain >= 0
    def con(x):
        s = 0
        for i in range(n_product):
            s += units[i] * mrp[i] * x[i]
    
        return budget - s
    
    print("Optimization ...")
    
    # Save optimal discounts for every product and every season
    df_opti = pd.DataFrame(data=None, columns=df_tot.columns)
    for season in pd.unique(df_tot['season']):
    
        # Objective function to minimize
        def obj(x):
            s = 0
            for i, product in enumerate(pipe_dict.keys()):
                s += pipe_dict[product].predict(build_sample(season, x[i]))
            
            return -s
    
        # PSO
        xopt, fopt = pso(obj, lb, ub, f_ieqcons=con, maxiter=maxiter)
        print("Season: {}\t xopt: {}".format(season, xopt))
    
        # Store result
        df_opti = pd.concat([
            df_opti,
            pd.DataFrame({
                'product': list(pipe_dict.keys()),
                'season': [season] * n_product,
                'discount_percentage': xopt,
                'sales_uplift_norm': [
                    pipe_dict[product].predict(build_sample(season, xopt[i]))[0] for i, product in enumerate(pipe_dict.keys())
                ]
            })
        ])
    
    # Save result
    df_opti = df_opti.reset_index().drop(columns=['index'])
    df_opti.to_excel("so_result.xlsx")
    print("Summary")
    print(df_opti)
    

    It gives :

    Training ...
    Optimization ...
    Stopping search: maximum iterations reached --> 5
    Season: summer   xopt: [0.1941521  0.11233673 0.36548761]
    Stopping search: maximum iterations reached --> 5
    Season: winter   xopt: [0.18670604 0.37829516 0.21857777]
    Stopping search: maximum iterations reached --> 5
    Season: monsoon  xopt: [0.14898102 0.39847885 0.18889792]
    Summary
      product   season  discount_percentage  sales_uplift_norm
    0       A   summer             0.194152           0.175973
    1       B   summer             0.112337           0.229735
    2       C   summer             0.365488           0.374510
    3       A   winter             0.186706          -0.028205
    4       B   winter             0.378295           0.266675
    5       C   winter             0.218578           0.146012
    6       A  monsoon             0.148981           0.199073
    7       B  monsoon             0.398479           0.307632
    8       C  monsoon             0.188898           0.210134
    

    这篇关于如何建立RF(随机森林)和PSO(粒子群优化器)的混合模型以找到最优的产品折扣?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆