Knn 给予距离上的特定特征更多的权重 [英] Knn give more weight to specific feature in distance
问题描述
我正在使用 Kobe Bryant 数据集.我想用 KnnRegressor 预测 shot_made_flag.
I'm using the Kobe Bryant Dataset. I wish to predict the shot_made_flag with KnnRegressor.
我使用 game_date
来提取 year
和 month
特征:
I've used game_date
to extract year
and month
features:
# covert season to years
kobe_data_encoded['season'] = kobe_data_encoded['season'].apply(lambda x: int(re.compile('(d+)-').findall(x)[0]))
# add year and month using game_date
kobe_data_encoded['year'] = kobe_data_encoded['game_date'].apply(lambda x: int(re.compile('(d{4})').findall(x)[0]))
kobe_data_encoded['month'] = kobe_data_encoded['game_date'].apply(lambda x: int(re.compile('-(d+)-').findall(x)[0]))
kobe_data_encoded = kobe_data_encoded.drop(columns=['game_date'])
并且我希望使用 season
、year
、month
功能在距离函数中赋予它们更多权重,以便日期更近的事件当前事件将是更近的邻居,但仍与潜在的其他数据点保持合理的距离,因此例如,我不希望同一天的事件仅因为日期特征而成为最近的邻居,但它会考虑其他功能,例如 shot_range
等.
为了给它更多的权重,我尝试将 metric
参数与自定义距离函数一起使用,但该函数的参数只是 numpy
数组,没有熊猫的列信息,所以我不确定我能做什么以及如何实施我正在尝试做的事情.
and I wish to use season
, year
, month
features to give them more weight in the distance function so events with closer date to the current event will be closer neighbors but still maintain reasonable distances to potential other datapoints, so for example I don't wish an event withing the same day would be the closest neighbor just because of the date features but it'll take into account the other features such as shot_range
etc..
To give it more weight I've tried to use metric
argument with custom distance function but the arguments of the function are just numpy
array without column information of pandas so I'm not sure what I can do and how to implement what I'm trying to do.
对日期特征使用更大的权重以在 [1, 100]
的 k
上运行 cv
为 10 找到最佳 k:
Using larger weights for date features to find the optimal k with cv
of 10 running on k
from [1, 100]
:
from IPython.display import display
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
# scaling
min_max_scaler = preprocessing.MinMaxScaler()
scaled_features_df = kobe_data_encoded.copy()
column_names = ['loc_x', 'loc_y', 'minutes_remaining', 'period',
'seconds_remaining', 'shot_distance', 'shot_type', 'shot_zone_range']
scaled_features = min_max_scaler.fit_transform(scaled_features_df[column_names])
scaled_features_df[column_names] = scaled_features
not_classified_df = scaled_features_df[scaled_features_df['shot_made_flag'].isnull()]
classified_df = scaled_features_df[scaled_features_df['shot_made_flag'].notnull()]
X = classified_df.drop(columns=['shot_made_flag'])
y = classified_df['shot_made_flag']
cv = StratifiedKFold(n_splits=10, shuffle=True)
neighbors = [x for x in range(1, 100)]
cv_scores = []
weight = np.ones((X.shape[1],))
weight[[X.columns.get_loc("season"),
X.columns.get_loc("year"),
X.columns.get_loc("month")
]] = 5
weight = weight/weight.sum() #Normalize weights
def my_distance(x, y):
dist = ((x-y)**2)
return np.dot(dist, weight)
for k in neighbors:
print('k: ', k)
knn = KNeighborsClassifier(n_neighbors=k, metric=my_distance)
cv_scores.append(np.mean(cross_val_score(knn, X, y, cv=cv, scoring='roc_auc')))
#optimal K
optimal_k_index = cv_scores.index(min(cv_scores))
optimal_k = neighbors[optimal_k_index]
print('best k: ', optimal_k)
plt.plot(neighbors, cv_scores)
plt.xlabel('Number of Neighbors K')
plt.ylabel('ROC AUC')
plt.show()
运行速度真的很慢,知道如何让它更快吗?加权特征的思想是寻找更接近数据点日期的邻居以避免数据泄漏和cv寻找最佳k.
Runs really slow, any idea on how to make it faster? The idea of the weighted features is to find neighbors more close to the data point date to avoid data leakage and cv for finding optimal k.
推荐答案
首先,您必须准备一个 numpy 一维 weight
数组,为每个特征指定权重.你可以这样做:
First, you have to prepare a numpy 1D weight
array, specifying weight for each feature. You could do something like:
weight = np.ones((M,)) # M is no of features
weight[[1,7,10]] = 2 # Increase weight of 1st,7th and 10th features
weight = weight/weight.sum() #Normalize weights
您可以使用 kobe_data_encoded.columns
在数据框中查找 season
、year
、month
特征的索引替换上面的第二行.
You can use kobe_data_encoded.columns
to find indexes of season
, year
, month
features in your dataframe to replace 2nd line above.
现在定义一个距离函数,根据指导原则,它必须采用两个一维 numpy 数组.
Now define a distance function, which by guideline have to take two 1D numpy array.
def my_dist(x,y):
global weight #1D array, same shape as x or y
dist = ((x-y)**2) #1D array, same shape as x or y
return np.dot(dist,weight) # a scalar float
并将 KNeighborsRegressor
初始化为:
knn = KNeighborsRegressor(metric=my_dist)
为了提高效率,您可以预先计算距离矩阵,并在 KNN
中重用它.这应该通过减少对 my_dist
的调用来显着提高速度,因为这个非矢量化的自定义 python 距离函数非常慢.所以现在 -
To make things efficient, you can precompute distance matrix, and reuse it in KNN
. This should bring in significant speedup by reducing calls to my_dist
, since this non-vectorized custom python distance function is quite slow. So now -
dist = np.zeros((len(X),len(X))) #Computing NXN distance matrix
for i in range(len(X)): # You can halve this by using the fact that dist[i,j] = dist[j,i]
for j in range(len(X)):
dist[i,j] = my_dist(X[i],X[j])
for k in neighbors:
print('k: ', k)
knn = KNeighborsClassifier(n_neighbors=k, metric='precomputed') #Note: metric='precomputed'
cv_scores.append(np.mean(cross_val_score(knn, dist, y, cv=cv, scoring='roc_auc'))) #Note: passing dist instead of X
我无法对其进行测试,所以如果出现问题,请告诉我.
I couldn't test it, so let me know if something isn't alright.
这篇关于Knn 给予距离上的特定特征更多的权重的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!