Python中的条件最近邻居 [英] Conditional Nearest Neighbor in Python
问题描述
我正在尝试使用Pandas/Numpy/Scipy等在Python中进行一些最近的邻居类型分析,并且尝试了几种不同的方法,我很沮丧.
I’m trying to do some nearest neighbour type analysis in Python using Pandas/Numpy/Scipy etc. and having tried a few different approaches, I’m stumped.
我有2个数据帧,如下所示:
I have is 2 dataframes as follows:
df1
Lon1 Lat1 Type
10 10 A
50 50 A
20 20 B
df2
Lon2 Lat2 Type Data-1 Data-2
11 11 A Eggs Bacon
51 51 A Nuts Bread
61 61 A Beef Lamb
21 21 B Chips Chicken
31 31 B Sauce Pasta
71 71 B Rice Oats
81 81 B Beans Peas
我正在尝试确定df2中的2个最近邻居(基于使用欧几里得距离的Lon/Lat值),然后将适当的Data-1和Data-2值合并到df1上,如下所示:>
I’m trying to identify the 2 nearest neighbours in df2 (based upon the Lon / Lat values using Euclidean distance) and then merge the appropriate Data-1 and Data-2 values onto df1 so it looks like this:
Lon1 Lat1 Type Data-1a Data-2a Data-1b Data-2b
10 10 A Eggs Bacon Nuts Bread
50 50 A Nuts Bread Beef Lamb
20 20 B Chips Chicken Sauce Pasta
我已经尝试了长形和宽形两种方法,并且倾向于使用scipy中的ckd树,但是有没有一种方法可以使它仅查看具有适当Type的行?
I’ve tried both long and wide form approaches and am leaning toward using ckd tree from scipy, however is there a way to do this so it only looks at rows with the appropriate Type?
谢谢.
**编辑**
我取得了一些进步,如下:
I've made some progress as follows:
Typelist = df2['Type'].unique().tolist()
df_dict = {'{}'.format(x): df2[(df2['Type'] == x)] for x in Rlist}
def treefunc(row):
if row['Type'] == 'A':
type = row['Type']
location = row[['Lon1','Lat1']].values
tree = cKDTree(df_dict[type][['Lon2','Lat2']].values)
dists, indexes = tree.query(location, k=2)
return dists,indexes
dftest = df1.apply(treefunc,axis=1)
这给了我2个最近邻居的距离和索引的列表,太好了!但是我仍然有一些问题:
This gives me a list of the distances and indexes of the 2 nearest neighbours which is great! However I still have some issues:
-
我尝试使用.isin测试row ['Type']列是否是Typelist的成员,但这不起作用-还有其他方法可以做到这一点吗?
I tried to test the row['Type'] column for membership of the Typelist using .isin but this didn't work - are there any other ways to do this?
如何让Pandas为kdtree生成的dists和index创建新列?
How can I get Pandas to create new columns for the dists and indexes produced by the kdtree?
我又如何使用索引返回Data-1和Data-2?
Also how can I return Data-1 and Data-2 using the indexes?
谢谢.
推荐答案
这很混乱,但是我认为这可能是一个很好的起点.我之所以使用scikit的实现,只是因为我比较舒服(尽管我自己很环保).
This is pretty messy but I think it might be a good starting point. I've used scikit's implementation, only because I'm more comfortable (though very green myself).
import pandas as pd
from io import StringIO
s1 = StringIO(u'''Lon2,Lat2,Type,Data-1,Data-2
11,11,A,Eggs,Bacon
51,51,A,Nuts,Bread
61,61,A,Beef,Lamb
21,21,B,Chips,Chicken
31,31,B,Sauce,Pasta
71,71,B,Rice,Oats
81,81,B,Beans,Peas''')
df2 = pd.read_csv(s1)
#Start here
from sklearn.neighbors import NearestNeighbors
import numpy as np
dfNN = pd.DataFrame()
idx = 0
for i in pd.unique(df2.Type):
dftype = df2[df2['Type'] == i].reindex()
X = dftype[['Lon2','Lat2']].values
nbrs = NearestNeighbors(n_neighbors=2, algorithm='kd_tree').fit(X)
distances, indices = nbrs.kneighbors(X)
for j in range(len(indices)):
dfNN = dfNN.append(dftype.iloc[[indices[j][0]]])
dfNN.loc[idx, 'Data-1b'] = dftype.iloc[[indices[j][1]]]['Data-1'].values[0]
dfNN.loc[idx, 'Data-2b'] = dftype.iloc[[indices[j][1]]]['Data-2'].values[0]
dfNN.loc[idx, 'Distance'] = distances[j][1]
idx += 1
dfNN = dfNN[['Lat2', 'Lon2', 'Type', 'Data-1', 'Data-2','Data-1b','Data-2b','Distance']]
这篇关于Python中的条件最近邻居的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!