Knn 给予距离上的特定特征更多的权重 [英] Knn give more weight to specific feature in distance

查看:32
本文介绍了Knn 给予距离上的特定特征更多的权重的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Kobe Bryant 数据集.我想用 KnnRegressor 预测 shot_made_flag.

I'm using the Kobe Bryant Dataset. I wish to predict the shot_made_flag with KnnRegressor.

我使用 game_date 来提取 yearmonth 特征:

I've used game_date to extract year and month features:

# covert season to years
kobe_data_encoded['season'] = kobe_data_encoded['season'].apply(lambda x: int(re.compile('(d+)-').findall(x)[0]))

# add year and month using game_date
kobe_data_encoded['year'] = kobe_data_encoded['game_date'].apply(lambda x: int(re.compile('(d{4})').findall(x)[0]))
kobe_data_encoded['month'] = kobe_data_encoded['game_date'].apply(lambda x: int(re.compile('-(d+)-').findall(x)[0]))
kobe_data_encoded = kobe_data_encoded.drop(columns=['game_date'])

并且我希望使用 seasonyearmonth 功能在距离函数中赋予它们更多权重,以便日期更近的事件当前事件将是更近的邻居,但仍与潜在的其他数据点保持合理的距离,因此例如,我不希望同一天的事件仅因为日期特征而成为最近的邻居,但它会考虑其他功能,例如 shot_range 等.
为了给它更多的权重,我尝试将 metric 参数与自定义距离函数一起使用,但该函数的参数只是 numpy 数组,没有熊猫的列信息,所以我不确定我能做什么以及如何实施我正在尝试做的事情.

and I wish to use season, year, month features to give them more weight in the distance function so events with closer date to the current event will be closer neighbors but still maintain reasonable distances to potential other datapoints, so for example I don't wish an event withing the same day would be the closest neighbor just because of the date features but it'll take into account the other features such as shot_range etc..
To give it more weight I've tried to use metric argument with custom distance function but the arguments of the function are just numpy array without column information of pandas so I'm not sure what I can do and how to implement what I'm trying to do.

对日期特征使用更大的权重以在 [1, 100]k 上运行 cv 为 10 找到最佳 k:

Using larger weights for date features to find the optimal k with cv of 10 running on k from [1, 100]:

from IPython.display import display
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

# scaling
min_max_scaler = preprocessing.MinMaxScaler()
scaled_features_df = kobe_data_encoded.copy()
column_names = ['loc_x', 'loc_y', 'minutes_remaining', 'period',
                'seconds_remaining', 'shot_distance', 'shot_type', 'shot_zone_range']
scaled_features = min_max_scaler.fit_transform(scaled_features_df[column_names])
scaled_features_df[column_names] = scaled_features

not_classified_df = scaled_features_df[scaled_features_df['shot_made_flag'].isnull()]
classified_df = scaled_features_df[scaled_features_df['shot_made_flag'].notnull()]
X = classified_df.drop(columns=['shot_made_flag'])
y = classified_df['shot_made_flag']
cv = StratifiedKFold(n_splits=10, shuffle=True)

neighbors = [x for x in range(1, 100)]
cv_scores = []

weight = np.ones((X.shape[1],))
weight[[X.columns.get_loc("season"),
 X.columns.get_loc("year"),
 X.columns.get_loc("month")
]] = 5
weight = weight/weight.sum()  #Normalize weights

def my_distance(x, y):
    dist = ((x-y)**2)
    return np.dot(dist, weight)

for k in neighbors:
    print('k: ', k)
    knn = KNeighborsClassifier(n_neighbors=k, metric=my_distance)
    cv_scores.append(np.mean(cross_val_score(knn, X, y, cv=cv, scoring='roc_auc')))

#optimal K
optimal_k_index = cv_scores.index(min(cv_scores))
optimal_k = neighbors[optimal_k_index]
print('best k: ', optimal_k)
plt.plot(neighbors, cv_scores)
plt.xlabel('Number of Neighbors K')
plt.ylabel('ROC AUC')
plt.show()

运行速度真的很慢,知道如何让它更快吗?加权特征的思想是寻找更接近数据点日期的邻居以避免数据泄漏和cv寻找最佳k.

Runs really slow, any idea on how to make it faster? The idea of the weighted features is to find neighbors more close to the data point date to avoid data leakage and cv for finding optimal k.

推荐答案

首先,您必须准备一个 numpy 一维 weight 数组,为每个特征指定权重.你可以这样做:

First, you have to prepare a numpy 1D weight array, specifying weight for each feature. You could do something like:

weight = np.ones((M,))  # M is no of features
weight[[1,7,10]] = 2    # Increase weight of 1st,7th and 10th features
weight = weight/weight.sum()  #Normalize weights

您可以使用 kobe_data_encoded.columns 在数据框中查找 seasonyearmonth 特征的索引替换上面的第二行.

You can use kobe_data_encoded.columns to find indexes of season, year, month features in your dataframe to replace 2nd line above.

现在定义一个距离函数,根据指导原则,它必须采用两个一维 numpy 数组.

Now define a distance function, which by guideline have to take two 1D numpy array.

def my_dist(x,y):
    global weight     #1D array, same shape as x or y
    dist = ((x-y)**2) #1D array, same shape as x or y
    return np.dot(dist,weight)  # a scalar float

并将 KNeighborsRegressor 初始化为:

knn = KNeighborsRegressor(metric=my_dist)

为了提高效率,您可以预先计算距离矩阵,并在 KNN 中重用它.这应该通过减少对 my_dist 的调用来显着提高速度,因为这个非矢量化的自定义 python 距离函数非常慢.所以现在 -

To make things efficient, you can precompute distance matrix, and reuse it in KNN. This should bring in significant speedup by reducing calls to my_dist, since this non-vectorized custom python distance function is quite slow. So now -

dist = np.zeros((len(X),len(X)))  #Computing NXN distance matrix
for i in range(len(X)):           # You can halve this by using the fact that dist[i,j] = dist[j,i]
    for j in range(len(X)):
        dist[i,j] = my_dist(X[i],X[j])

for k in neighbors:
    print('k: ', k)
    knn = KNeighborsClassifier(n_neighbors=k, metric='precomputed') #Note: metric='precomputed' 
    cv_scores.append(np.mean(cross_val_score(knn, dist, y, cv=cv, scoring='roc_auc'))) #Note: passing dist instead of X

我无法对其进行测试,所以如果出现问题,请告诉我.

I couldn't test it, so let me know if something isn't alright.

这篇关于Knn 给予距离上的特定特征更多的权重的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆