如何建立RF(随机森林)和PSO(粒子群优化器)的混合模型以找到最优的产品折扣? [英] How to build hybrid model of RF(Random Forest) and PSO(Particle Swarm Optimizer) to find optimal discount of products?
问题描述
我需要找到每种产品的最佳折扣(例如A,B,C),以便我能使总销售额最大化.对于每种产品,我都有现有的随机森林"模型,这些模型将折扣和季节映射到销售额.如何合并这些模型并将其馈送到优化器,以找到每种产品的最佳折扣?
选择模型的原因:
- RF:它能够在预测变量和响应(sales_uplift_norm)之间提供更好的(无线性模型)关系.
- PSO:在许多白皮书中都有建议(可在researchgate/IEEE上找到),以及在python
我跟随的想法/步骤:
- 为每种产品构建RF模型
#预处理数据products_pre_processed_data = {key:pre_process_data(df,key)用于df_basepack_dict.items()中的密钥df#射频型号products_rf_model = {keys:rf_fit(df),表示products_pre_processed_data .items()中的df
- 将模型传递给优化器
- 目标函数:最大化 sales_uplift_norm (RF模型的响应变量)
- 约束:
- 总支出(支出A + B + C <= 20),支出= total_units_sold_of_products *折扣百分比* mrp_of_products
- 产品(A,B,C)的下界:[0.0,0.0,0.0]#折扣百分比下界
- 产品(A,B,C)的上限:[0.3,0.4,0.4]#折扣百分比上限
sudo/示例代码#,因为我找不到将product_models传递给优化器的方法.
从pyswarm导入pso的定义对象(x):型号1 = products_rf_model.get('A')模型2 = products_rf_model.get('B')模型3 = products_rf_model.get('C')return-(model1 + model2 + model3)#-ve表示最大化def con(x):x1 = x [0]x2 = x [1]x3 = x [2]return np.sum(units_A * x * mrp_A + units_B * x * mrp_B + units_C * x * spend_C)-20#支出预算磅= [0.0,0.0,0.0]ub = [0.3,0.4,0.4]xopt,fopt = pso(obj,lb,ub,f_ieqcons = con)
尊敬的SO专家,请就如何使用 PSO优化器(或其他优化器,如果我没有遵循正确的方法)的问题,寻求您的指导(几周以来一直在寻找任何指导))与射频.
添加用于模型的功能:
def pre_process_data(df,product):数据= df.copy().reset_index()#打印(数据)bp =产品print("----------产品:{} ----------".format(bp))#预处理步骤打印(预处理df.shape {}".format(df.shape))#1.响应var转换响应= data.sales_uplift_norm#已转换#2.预测变量数值var变换numeric_vars = ['discount_percentage']#可能包含mrp,深度df_numeric =数据[numeric_vars]df_norm = df_numeric.apply(lambda x:scale(x),axis = 0)#中心和比例#3.char字段实体化#选择类别字段cat_cols = data.select_dtypes('category').columns#选择字符串字段str_to_cat_cols = data.drop(['product'],axis = 1).select_dtypes('object').astype('category').columns#合并所有分类字段all_cat_cols = [*cat_cols,*str_to_cat_cols]#打印(all_cat_cols)#将猫转换成假人df_dummies = pd.get_dummies(data [all_cat_cols])#4.将num和char df结合在一起df_combined = pd.concat([df_dummies.reset_index(drop = True),df_norm.reset_index(drop = True)],轴= 1)df_combined ['sales_uplift_norm'] =回应df_processed = df_combined.copy()print(后期处理df.shape {}".format(df_processed.shape))#print(模型字段:{}".format(df_processed.columns))返回(df_processed)def rf_fit(df,random_state = 12):train_features = df.drop('sales_uplift_norm',轴= 1)train_labels = df ['sales_uplift_norm']#随机森林回归rf = RandomForestRegressor(n_estimators = 500,random_state = random_state,bootstrap = True,oob_score = True)#射频模型rf_fit = rf.fit(train_features,train_labels)返回(rf_fit)
编辑:将数据集更新为简化版本.
解决方案您可以在下面找到完整的解决方案!
与您的方法的基本区别如下:
- 由于随机森林模型将
季节
功能作为输入,因此必须为每个季节计算最佳折扣. - 检查 pyswarm 的文档,
con
函数会产生输出,必须符合con(x)> = 0.0
.因此,正确的约束是20-sum(...)
,而不是相反.另外,没有给出units
和mrp
变量;我只是假设值为1,您可能想更改这些值.
对原始代码的其他修改包括:
- sklearn 的预处理和管道包装程序,以简化预处理步骤.
- 最佳参数存储在输出
.xlsx
文件中. - PSO的
maxiter
参数已设置为5
以加快调试速度,您可能希望将其值设置为另一个值(默认值=100
).
因此,代码为:
将熊猫作为pd导入从sklearn.pipeline导入管道从sklearn.preprocessing导入OneHotEncoder,StandardScaler从sklearn.compose导入ColumnTransformer从sklearn.ensemble导入RandomForestRegressor从sklearn.base导入克隆#===================== RF训练=======================#预处理def build_sample(季节,折价百分比):返回pd.DataFrame({'季节':[季节],'discount_percentage':[discount_percentage]})columns_to_encode = [季节"]column_to_scale = ["discount_percentage"]编码器= OneHotEncoder()定标器= StandardScaler()preproc = ColumnTransformer(变压器= [("encoder",管道([("OneHotEncoder",encoder))),columns_to_encode),(缩放器",管道([("StandardScaler",缩放器)]),columns_to_scale)])# 模型myRFClassifier = RandomForestRegressor(n_estimators = 500,random_state = 12bootstrap = True,oob_score = True)pipeline_list = [("preproc",preproc),("clf",myRFClassifier)]管道=管道(pipeline_list)#数据集df_tot = pd.read_excel(so_data.xlsx")df_dict = {product:df_tot [df_tot ['product'] == product] .drop(columns = ['product'])for pd.unique(df_tot ['product'])}# 合身打印(培训...")pipe_dict = {产品:df_dict.keys()中产品的clone(pipe)}对于产品,df_dict.items()中的df:X = df.drop(columns = ["sales_uplift_norm"])y = df ["sales_uplift_norm"]pipe_dict [product] .fit(X,y)#======================优化======================从pyswarm导入pso#PSO的参数最大= 5n_product = len(pipe_dict.keys())#约束预算= 20单位= [1,1,1]mrp = [1,1,1]磅= [0.0,0.0,0.0]ub = [0.3,0.4,0.4]#必须始终保持> = 0def con(x):s = 0对于我在范围内(n_product):s + =单位[i] * mrp [i] * x [i]退货预算-s打印(优化...")#为每个产品和每个季节节省最佳折扣df_opti = pd.DataFrame(数据=无,列= df_tot.columns)对于pd.unique(df_tot ['season'])中的季节:#目标函数以最小化定义对象(x):s = 0对于我来说,枚举(pipe_dict.keys())中的乘积:s + = pipe_dict [product] .predict(build_sample(season,x [i]))返回-s#PSOxopt,fopt = pso(obj,lb,ub,f_ieqcons = con,maxiter = maxiter)print("Season:{} \ t xopt:{}".format(season,xopt))#存储结果df_opti = pd.concat([df_opti,pd.DataFrame({'产品':列表(pipe_dict.keys()),'季节':[季节] * n_product,'discount_percentage':xopt,'sales_uplift_norm':[pipe_dict [product] .predict(build_sample(season,xopt [i]))[0] for i,枚举中的乘积(pipe_dict.keys())]})])#保存结果df_opti = df_opti.reset_index().drop(columns = ['index'])df_opti.to_excel("so_result.xlsx")打印(摘要")打印(df_opti)
它给出了:
培训...优化 ...停止搜索:已达到最大迭代次数->5季节:夏季xopt:[0.1941521 0.11233673 0.36548761]停止搜索:已达到最大迭代次数->5季节:冬季xopt:[0.18670604 0.37829516 0.21857777]停止搜索:已达到最大迭代次数->5季节:季风xopt:[0.14898102 0.39847885 0.18889792]概括产品季节折扣_销售百分比_提升_标准0一个夏天0.194152 0.1759731 B夏季0.112337 0.2297352 C夏季0.365488 0.3745103冬季0.186706 -0.0282054 B冬季0.378295 0.2666755 C冬季0.218578 0.1460126季风0.148981 0.1990737 B 季风 0.398479 0.3076328 C季风0.188898 0.210134
I need to find optimal discount for each product (in e.g. A, B, C) so that I can maximize total sales. I have existing Random Forest models for each product that map discount and season to sales. How do I combine these models and feed them to an optimiser to find the optimum discount per product?
Reason for model selection:
- RF: it's able to give better(w.r.t linear models) relation between predictors and response(sales_uplift_norm).
- PSO: suggested in many white papers(available at researchgate/IEEE), also availability of the package in python here and here.
Input data: sample data used to build model at product level. Glance of the data as below:
Idea/Steps followed by me:
- Build RF model per products
# pre-processed data products_pre_processed_data = {key:pre_process_data(df, key) for key, df in df_basepack_dict.items()} # rf models products_rf_model = {key:rf_fit(df) for key, df in products_pre_processed_data .items()}
- Pass the model to optimizer
- Objective function: maximize sales_uplift_norm (the response variable of RF model)
- Constraint:
- total spend(spends of A + B + C <= 20), spends = total_units_sold_of_products * discount_percentage * mrp_of_products
- lower bound of products(A, B, C): [0.0, 0.0, 0.0] # discount percentage lower bounds
- upper bound of products(A, B, C): [0.3, 0.4, 0.4] # discount percentage upper bounds
sudo/sample code # as I am unable to find a way to pass the product_models into optimizer.
from pyswarm import pso def obj(x): model1 = products_rf_model.get('A') model2 = products_rf_model.get('B') model3 = products_rf_model.get('C') return -(model1 + model2 + model3) # -ve sign as to maximize def con(x): x1 = x[0] x2 = x[1] x3 = x[2] return np.sum(units_A*x*mrp_A + units_B*x*mrp_B + units_C* x *spend_C)-20 # spend budget lb = [0.0, 0.0, 0.0] ub = [0.3, 0.4, 0.4] xopt, fopt = pso(obj, lb, ub, f_ieqcons=con)
Dear SO experts, Request your guidance(struggling to find any guidance since couple of weeks) on how to use the PSO optimizer(or any other optimizer if I am not following right one) with RF.
Adding functions used for model:
def pre_process_data(df,product): data = df.copy().reset_index() # print(data) bp = product print("----------product: {}----------".format(bp)) # Pre-processing steps print("pre process df.shape {}".format(df.shape)) #1. Reponse var transformation response = data.sales_uplift_norm # already transformed #2. predictor numeric var transformation numeric_vars = ['discount_percentage'] # may include mrp, depth df_numeric = data[numeric_vars] df_norm = df_numeric.apply(lambda x: scale(x), axis = 0) # center and scale #3. char fields dummification #select category fields cat_cols = data.select_dtypes('category').columns #select string fields str_to_cat_cols = data.drop(['product'], axis = 1).select_dtypes('object').astype('category').columns # combine all categorical fields all_cat_cols = [*cat_cols,*str_to_cat_cols] # print(all_cat_cols) #convert cat to dummies df_dummies = pd.get_dummies(data[all_cat_cols]) #4. combine num and char df together df_combined = pd.concat([df_dummies.reset_index(drop=True), df_norm.reset_index(drop=True)], axis=1) df_combined['sales_uplift_norm'] = response df_processed = df_combined.copy() print("post process df.shape {}".format(df_processed.shape)) # print("model fields: {}".format(df_processed.columns)) return(df_processed) def rf_fit(df, random_state = 12): train_features = df.drop('sales_uplift_norm', axis = 1) train_labels = df['sales_uplift_norm'] # Random Forest Regressor rf = RandomForestRegressor(n_estimators = 500, random_state = random_state, bootstrap = True, oob_score=True) # RF model rf_fit = rf.fit(train_features, train_labels) return(rf_fit)
EDIT: updated dataset to simplified version.
解决方案you can find a complete solution below !
The fundamental differences with your approach are the following :
- Since the Random Forest model takes as input the
season
feature, optimal discounts must be computed for every season. - Inspecting the documentation of pyswarm, the
con
function yields an output that must comply withcon(x) >= 0.0
. The correct constraint is therefore20 - sum(...)
and not the other way around. In addition, theunits
andmrp
variable were not given ; I just assumed a value of 1, you might want to change those values.
Additional modifications to your original code include :
- Preprocessing and pipeline wrappers of
sklearn
in order to simplify the preprocessing steps. - Optimal parameters are stored in an output
.xlsx
file. - The
maxiter
parameter of the PSO has been set to5
to speed-up debugging, you might want to set its value to another one (default =100
).
The code is therefore :
import pandas as pd from sklearn.pipeline import Pipeline from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.compose import ColumnTransformer from sklearn.ensemble import RandomForestRegressor from sklearn.base import clone # ====================== RF TRAINING ====================== # Preprocessing def build_sample(season, discount_percentage): return pd.DataFrame({ 'season': [season], 'discount_percentage': [discount_percentage] }) columns_to_encode = ["season"] columns_to_scale = ["discount_percentage"] encoder = OneHotEncoder() scaler = StandardScaler() preproc = ColumnTransformer( transformers=[ ("encoder", Pipeline([("OneHotEncoder", encoder)]), columns_to_encode), ("scaler", Pipeline([("StandardScaler", scaler)]), columns_to_scale) ] ) # Model myRFClassifier = RandomForestRegressor( n_estimators = 500, random_state = 12, bootstrap = True, oob_score = True) pipeline_list = [ ('preproc', preproc), ('clf', myRFClassifier) ] pipe = Pipeline(pipeline_list) # Dataset df_tot = pd.read_excel("so_data.xlsx") df_dict = { product: df_tot[df_tot['product'] == product].drop(columns=['product']) for product in pd.unique(df_tot['product']) } # Fit print("Training ...") pipe_dict = { product: clone(pipe) for product in df_dict.keys() } for product, df in df_dict.items(): X = df.drop(columns=["sales_uplift_norm"]) y = df["sales_uplift_norm"] pipe_dict[product].fit(X,y) # ====================== OPTIMIZATION ====================== from pyswarm import pso # Parameter of PSO maxiter = 5 n_product = len(pipe_dict.keys()) # Constraints budget = 20 units = [1, 1, 1] mrp = [1, 1, 1] lb = [0.0, 0.0, 0.0] ub = [0.3, 0.4, 0.4] # Must always remain >= 0 def con(x): s = 0 for i in range(n_product): s += units[i] * mrp[i] * x[i] return budget - s print("Optimization ...") # Save optimal discounts for every product and every season df_opti = pd.DataFrame(data=None, columns=df_tot.columns) for season in pd.unique(df_tot['season']): # Objective function to minimize def obj(x): s = 0 for i, product in enumerate(pipe_dict.keys()): s += pipe_dict[product].predict(build_sample(season, x[i])) return -s # PSO xopt, fopt = pso(obj, lb, ub, f_ieqcons=con, maxiter=maxiter) print("Season: {}\t xopt: {}".format(season, xopt)) # Store result df_opti = pd.concat([ df_opti, pd.DataFrame({ 'product': list(pipe_dict.keys()), 'season': [season] * n_product, 'discount_percentage': xopt, 'sales_uplift_norm': [ pipe_dict[product].predict(build_sample(season, xopt[i]))[0] for i, product in enumerate(pipe_dict.keys()) ] }) ]) # Save result df_opti = df_opti.reset_index().drop(columns=['index']) df_opti.to_excel("so_result.xlsx") print("Summary") print(df_opti)
It gives :
Training ... Optimization ... Stopping search: maximum iterations reached --> 5 Season: summer xopt: [0.1941521 0.11233673 0.36548761] Stopping search: maximum iterations reached --> 5 Season: winter xopt: [0.18670604 0.37829516 0.21857777] Stopping search: maximum iterations reached --> 5 Season: monsoon xopt: [0.14898102 0.39847885 0.18889792] Summary product season discount_percentage sales_uplift_norm 0 A summer 0.194152 0.175973 1 B summer 0.112337 0.229735 2 C summer 0.365488 0.374510 3 A winter 0.186706 -0.028205 4 B winter 0.378295 0.266675 5 C winter 0.218578 0.146012 6 A monsoon 0.148981 0.199073 7 B monsoon 0.398479 0.307632 8 C monsoon 0.188898 0.210134
这篇关于如何建立RF(随机森林)和PSO(粒子群优化器)的混合模型以找到最优的产品折扣?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!