Knn赋予距离特定特征更多的权重 [英] Knn give more weight to specific feature in distance

查看:457
本文介绍了Knn赋予距离特定特征更多的权重的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用科比·布莱恩特数据集. 我希望使用KnnRegressor来预测shot_made_flag.

I'm using the Kobe Bryant Dataset. I wish to predict the shot_made_flag with KnnRegressor.

我已经使用game_date提取了yearmonth功能:

I've used game_date to extract year and month features:

# covert season to years
kobe_data_encoded['season'] = kobe_data_encoded['season'].apply(lambda x: int(re.compile('(\d+)-').findall(x)[0]))

# add year and month using game_date
kobe_data_encoded['year'] = kobe_data_encoded['game_date'].apply(lambda x: int(re.compile('(\d{4})').findall(x)[0]))
kobe_data_encoded['month'] = kobe_data_encoded['game_date'].apply(lambda x: int(re.compile('-(\d+)-').findall(x)[0]))
kobe_data_encoded = kobe_data_encoded.drop(columns=['game_date'])

,并且我希望使用seasonyearmonth功能在距离函数中赋予它们更大的权重,因此,与当前事件日期更近的事件将成为更近的邻居,但仍与潜在的其他事件保持合理的距离数据点,因此例如,我不希望同一天发生的事件会因为日期功能而成为最近的邻居,但是它将考虑其他功能,例如shot_range等.
为了给它更大的权重,我尝试将metric参数与自定义距离函数一起使用,但是该函数的参数只是numpy数组而没有熊猫的列信息,因此我不确定该怎么做以及如何实现我想做的事.

and I wish to use season, year, month features to give them more weight in the distance function so events with closer date to the current event will be closer neighbors but still maintain reasonable distances to potential other datapoints, so for example I don't wish an event withing the same day would be the closest neighbor just because of the date features but it'll take into account the other features such as shot_range etc..
To give it more weight I've tried to use metric argument with custom distance function but the arguments of the function are just numpy array without column information of pandas so I'm not sure what I can do and how to implement what I'm trying to do.

对日期特征使用更大的权重,以找到cv为10的最优k,并在[1, 100]中运行k:

Using larger weights for date features to find the optimal k with cv of 10 running on k from [1, 100]:

from IPython.display import display
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

# scaling
min_max_scaler = preprocessing.MinMaxScaler()
scaled_features_df = kobe_data_encoded.copy()
column_names = ['loc_x', 'loc_y', 'minutes_remaining', 'period',
                'seconds_remaining', 'shot_distance', 'shot_type', 'shot_zone_range']
scaled_features = min_max_scaler.fit_transform(scaled_features_df[column_names])
scaled_features_df[column_names] = scaled_features

not_classified_df = scaled_features_df[scaled_features_df['shot_made_flag'].isnull()]
classified_df = scaled_features_df[scaled_features_df['shot_made_flag'].notnull()]
X = classified_df.drop(columns=['shot_made_flag'])
y = classified_df['shot_made_flag']
cv = StratifiedKFold(n_splits=10, shuffle=True)

neighbors = [x for x in range(1, 100)]
cv_scores = []

weight = np.ones((X.shape[1],))
weight[[X.columns.get_loc("season"),
 X.columns.get_loc("year"),
 X.columns.get_loc("month")
]] = 5
weight = weight/weight.sum()  #Normalize weights

def my_distance(x, y):
    dist = ((x-y)**2)
    return np.dot(dist, weight)

for k in neighbors:
    print('k: ', k)
    knn = KNeighborsClassifier(n_neighbors=k, metric=my_distance)
    cv_scores.append(np.mean(cross_val_score(knn, X, y, cv=cv, scoring='roc_auc')))

#optimal K
optimal_k_index = cv_scores.index(min(cv_scores))
optimal_k = neighbors[optimal_k_index]
print('best k: ', optimal_k)
plt.plot(neighbors, cv_scores)
plt.xlabel('Number of Neighbors K')
plt.ylabel('ROC AUC')
plt.show()

运行速度真的很慢,是否知道如何使其更快? 加权特征的想法是找到更接近数据点日期的邻居,以避免数据泄漏和cv来寻找最佳k.

Runs really slow, any idea on how to make it faster? The idea of the weighted features is to find neighbors more close to the data point date to avoid data leakage and cv for finding optimal k.

推荐答案

首先,您必须准备一个numpy 1D weight数组,为每个功能指定权重.您可以执行以下操作:

First, you have to prepare a numpy 1D weight array, specifying weight for each feature. You could do something like:

weight = np.ones((M,))  # M is no of features
weight[[1,7,10]] = 2    # Increase weight of 1st,7th and 10th features
weight = weight/weight.sum()  #Normalize weights

您可以使用kobe_data_encoded.columns在数据框中查找seasonyearmonth功能的索引,以替换上面的第二行.

You can use kobe_data_encoded.columns to find indexes of season, year, month features in your dataframe to replace 2nd line above.

现在定义一个距离函数,根据指南它必须采用两个一维numpy数组.

Now define a distance function, which by guideline have to take two 1D numpy array.

def my_dist(x,y):
    global weight     #1D array, same shape as x or y
    dist = ((x-y)**2) #1D array, same shape as x or y
    return np.dot(dist,weight)  # a scalar float

并将KNeighborsRegressor初始化为:

knn = KNeighborsRegressor(metric=my_dist)

为了提高效率,您可以预先计算距离矩阵,然后在KNN中重新使用它.这将通过减少对my_dist的调用来显着提高速度,因为此非矢量化的自定义python距离函数非常慢.所以现在-

To make things efficient, you can precompute distance matrix, and reuse it in KNN. This should bring in significant speedup by reducing calls to my_dist, since this non-vectorized custom python distance function is quite slow. So now -

dist = np.zeros((len(X),len(X)))  #Computing NXN distance matrix
for i in range(len(X)):           # You can halve this by using the fact that dist[i,j] = dist[j,i]
    for j in range(len(X)):
        dist[i,j] = my_dist(X[i],X[j])

for k in neighbors:
    print('k: ', k)
    knn = KNeighborsClassifier(n_neighbors=k, metric='precomputed') #Note: metric='precomputed' 
    cv_scores.append(np.mean(cross_val_score(knn, dist, y, cv=cv, scoring='roc_auc'))) #Note: passing dist instead of X

我无法测试它,所以让我知道是否有问题.

I couldn't test it, so let me know if something isn't alright.

这篇关于Knn赋予距离特定特征更多的权重的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆