如何在sklearn中的python中的RandomForestRegressor中标记特殊情况 [英] How to label special cases in RandomForestRegressor in sklearn in python

查看:281
本文介绍了如何在sklearn中的python中的RandomForestRegressor中标记特殊情况的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我为数据集中的每个用户提供了一组数字特征(f1,f2,f3,f4,f5).

I have a set of numerical features (f1, f2, f3, f4, f5) as follows for each user in my dataset.

       f1   f2  f3  f4  f5
user1  0.1  1.1  0 1.7  1
user2  1.1  0.3  1 1.3  3
user3  0.8  0.3  0 1.1  2
user4  1.5  1.2  1 0.8  3
user5  1.6  1.3  3 0.3  0

我的目标输出是优先的用户列表.即如下例所示.

My target output is a prioritised user list. i.e. as shown in the example below.

       f1   f2  f3  f4  f5  target_priority
user1  0.1  1.1  0 1.7  1       2
user2  1.1  0.3  1 1.3  3       1
user3  0.8  0.3  0 1.1  2       5
user4  1.5  1.2  1 0.8  3       3
user5  1.6  1.3  3 0.3  0       4

我想以反映用户优先级的方式使用这些功能.目前,我正在使用sklearns RandomForestRegressor`来执行此任务.

I want to use these features in a way that reflect the priority of the user. Currently, I am using sklearnsRandomForestRegressor` to perform this task.

但是,我最近得到了真实的数据集,并且它具有一些没有优先级标签的用户.这是因为此类用户对我们公司并不重要(更像普通用户).

However, I got my real dataset recently and it has some users with no priority label. That is because such users are not important to our company (more like general users).

示例(真实数据集的样子)

Example (How real dataset looks like):

       f1   f2  f3  f4  f5  target_priority
user1  0.1  1.1  0 1.7  1       2
user2  1.1  0.3  1 1.3  3       2
user3  0.8  0.3  0 1.1  2       N/A
user4  1.5  1.2  1 0.8  3       N/A
user5  1.6  1.3  3 0.3  0       1

在这种特殊情况下(没有priority label),给他们一个特殊的符号或比现有优先级(例如100000000000000000 priority)低很多的优先级是件好事吗? 在RandomForestRegressor中如何处理这种特殊情况?

In such special cases (that does not have a priority label), is it good to give them a special symbol or a priority level that is much much lower than the existing priorities (e.g., 100000000000000000 priority)? How does such special cases are handled in RandomForestRegressor?

如果需要,我很乐意提供更多详细信息?

I am happy to provide more details if needed?

推荐答案

如果80-90%不需要优先级,则应该构建一个分类器来决定是否需要分配优先级,因为这样会如果是偏斜的类,我建议您使用决策树或异常检测作为分类器,需要优先级的数据点将是异常,您可以使用Sklearn.

Okay if 80-90% don't need a priority, you should build a classifier that decides whether the priority needs to be assigned or not, since this would be a skewed class, I would recommend you to use Decision tree or Anomaly Detection as classifier, data points that require priority will be an anomaly, you can use Sklearn for these.

在确定了必须分配优先级的对象之后,我将研究关于优先级的训练数据的分布,您说优先级的范围是1-100,因此,如果您至少有5,000个数据点,每个优先级级别至少有35个示例,我建议使用多类分类器(首选具有rbf内核的SVC)和混淆矩阵来检查矩阵的准确性,如果这样做不起作用,则必须在数据上使用回归器然后四舍五入.

After deciding the objects that have to be assigned a priority, I will look into the distribution of the training data with respect to priorities, you said that priorities range from 1-100, so if you have at least 5,000 data points and each priority level has at least 35 examples, I would suggest Multi class classifier(SVC with rbf kernel is preferred) and confusion matrix for checking the accuracy of the matrix, if that doesn't work You will have to use a regressor on the data and then round the answer.

我的基本意思是,如果数据足够大,并且目标标签之间的分布均匀,则进行多类分类;如果数据不够大,则进行分类器;如果您需要任何部分的代码让我知道.

What I basically mean is that if the data is huge enough, and there is an even distribution among target label, go for Multiclass classification, if the data is not big enough go for a classifier, If you want code for any part of it, let me know.

编辑代码

好吧,让我们从顶部开始,首先,在所有情况下,将NA值存储为np.nan或将它们存储为?等符号或将文本向上存储为N.A.在目标标签的类型为object的情况下,要检查使用df[['target']].dtypes是int还是float,可以跳过第一步,但是如果显示为object,则需要首先进行修复.

Ok so let's take it from the top, firstly either in your target the N.A. values are stored as np.nan or they are stored as symbols like ? or straight up text like N.A. in all cases this will result in your target label being of type object, to check use df[['target']].dtypes if it says int or float, you can skip the first step, but if it says object, then we need to fix that first.

df.loc[df['Target'] == 'N.A.', 'Target'] = np.nan #np = Numpy `N.A.` can be any placeholder that is being used by tour dataset for N.A. values.
df[['target']] = df[['target']].astype(float)

现在让我们进入第二部分,您需要获取分类器的目标才能使用

Now let's move to part two, where you need to get the target for your classifier, to do that use

df2 = pd.DataFrame()
df2['Bool'] = df[['Target']] != np.nan
df1 =  pd.concat([df, df2], axis = 1)
df1.head() #Sanity check

这将通过在分配优先级时添加true来更新数据框,此列将作为您分类器的目标. 通知使用df1而不是df,现在从第一部分中将Targetdf1删除,因为它并不重要. df1.drop(['Target'], axis = 1, inplace = True)

This will update your dataframe by adding true whenever a priority was assigned, this column will be you target for your classifier. Notice using df1 and not df, now drop the Target from df1 as it is not important, for the first part. df1.drop(['Target'], axis = 1, inplace = True)

现在我将在这里使用随机森林分类,因为应该避免异常检测,直到类倾斜到98%为止,但是您可以查看它

Now I am going to use random forest classification in this since Anomaly detection should be avoided till classes are skewed upto 98%, but you can look at it here.

继续,以构建随机森林分类器

Moving on, to build the random forest classifier

clf = RandomForestClassifier(n_estimators=100, max_depth=2) #Note max depth is a hyper parameter and you will need to tune it.
clf.fit (df1.drop(['Bool'],axis=1),df1['Bool'])

要删除输出为假的行

df1 = df1[df['Bool'] == True]

然后仅在新数据上使用clf.predict().删除输出为false的行,并对其余数据运行回归器.我假设您可以做回归器部分,因为现在完全可以直接进行了.让我知道您是否还有其他问题.

Then just use clf.predict() on the new data. Drop the rows where the output comes as false and run a regressor on the remaining data. I am assuming you can do the regressor part as that is now completely straight forward. Let me know if you face any further issues.

这篇关于如何在sklearn中的python中的RandomForestRegressor中标记特殊情况的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆