由最近的已知邻居以特定的纬度/经度填充丢失的数据 [英] Gapfilling missing data at specific latitude/longitude by nearest known neighbours
问题描述
我有一个大约200万行的数据集,由特定纬度和经度的各种属性组成.对于每个属性,我都有一个评估值和一个建筑面积.评估已经完成,但并非所有物业都有占地面积.
我想使用一些最接近的邻居方法进行插值,以近似表中特定缺少的NaN
值.我的软件是用Python编写的,因此可能需要使用Numpy,Pandas,Scipy或某种组合.
如果您对图书馆开放,则可以使用 结果 I have a dataset of about 2 million rows, consisting of various properties at specific latitudes and longitudes. For each property, I have a valuation and a floor area. The valuations are complete but not all properties have floor areas. I want to interpolate using some nearest neighbours method to approximate for the specific missing I've had a look at using SciPy's cKDTree, as well as some distance approximation using a Haversine formula to calculate distances, but all the examples I've seen are about interpolating across a plane rather than for gap-filling missing data, and I'm a bit at a loss as to how to achieve this. As an example, here's the first few rows of what I've been using as test data (ratio is simply The properties themselves can be further grouped to get better relationships (this isn't part of the test data, but each property can be used for a different purpose, e.g. office, factory, shop). I realise I can loop through this slowly, getting groups of properties by distance apart (testing each
If you are open to libraries, you can use a Distance matrix Assuming df your main dataframe The result
这篇关于由最近的已知邻居以特定的纬度/经度填充丢失的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋! lat long value area ratio
0 57.101474 -2.242851 12850 252.0 50.992063
1 57.102554 -2.246308 14700 309.0 47.572816
2 57.100556 -2.248342 25600 507.0 50.493097
3 57.101765 -2.254688 28000 491.0 57.026477
4 57.097553 -2.245483 5650 119.0 47.478992
5 57.098244 -2.245768 43000 811.0 53.020962
6 57.098554 -2.252504 46300 850.0 54.470588
7 57.102794 -2.243454 7850 180.0 43.611111
8 57.101474 -2.242851 26250 514.0 50.99206349
9 57.101893 -2.239883 31000 607.0 51.00502513
10 57.101383 -2.238955 28750 563.0 51.00502513
11 57.104578 -2.235641 18500 327.0 56.574924
12 57.105424 -2.234953 21950 406.0 54.064039
13 57.105516 -2.233683 19600 408.0 48.039216
NaN
values in the table. My software is written in Python, so probably requires using Numpy, Pandas, Scipy or some combination.value/area
):lat | long | value | area | ratio
----------|-----------|-------|-------|----------
57.101474 | -2.242851 | 12850 | 252.0 | 50.992063
57.102554 | -2.246308 | 14700 | 309.0 | 47.572816
57.100556 | -2.248342 | 25600 | 507.0 | 50.493097
57.101765 | -2.254688 | 28000 | 491.0 | 57.026477
57.097553 | -2.245483 | 5650 | 119.0 | 47.478992
57.098244 | -2.245768 | 43000 | 811.0 | 53.020962
57.098554 | -2.252504 | 46300 | 850.0 | 54.470588
57.102794 | -2.243454 | 7850 | 180.0 | 43.611111
57.101474 | -2.242851 | 26250 | NaN | NaN
57.101893 | -2.239883 | 31000 | NaN | NaN
57.101383 | -2.238955 | 28750 | NaN | NaN
57.104578 | -2.235641 | 18500 | 327.0 | 56.574924
57.105424 | -2.234953 | 21950 | 406.0 | 54.064039
57.105516 | -2.233683 | 19600 | 408.0 | 48.039216
NaN
property against the rest) but that would seem to be heartbreakingly glacial.df.to_clipboard()
output (first 15 rows): lat long value area ratio
0 57.101474 -2.242851 12850 252.0 50.992063
1 57.102554 -2.246308 14700 309.0 47.572816
2 57.100556 -2.248342 25600 507.0 50.493097
3 57.101765 -2.254688 28000 491.0 57.026477
4 57.097553 -2.245483 5650 119.0 47.478992
5 57.098244 -2.245768 43000 811.0 53.020962
6 57.098554 -2.252504 46300 850.0 54.470588
7 57.102794 -2.243454 7850 180.0 43.611111
8 57.101474 -2.242851 26250 NaN NaN
9 57.101893 -2.239883 31000 NaN NaN
10 57.101383 -2.238955 28750 NaN NaN
11 57.104578 -2.235641 18500 327.0 56.574924
12 57.105424 -2.234953 21950 406.0 54.064039
13 57.105516 -2.233683 19600 408.0 48.039216
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
import pandas as pd
def find_closest(x, df):
#Supress itself
d = x.drop(x.name).to_dict()
#sort the distance
v = sorted(d, key=lambda k: d[k])
#Find the closest with a non nan area value else return NaN
for i in v :
if i in df[~df.area.isnull()].index:
return df.loc[i].ratio
else:
pass
return np.nan
df_matrix_distance = pd.DataFrame(euclidean_distances(df[["lat","long"]]))
#Get the null values in area
df_nan = df[df.area.isnull()]
#get the values
res = df_matrix_distance.loc[df_nan.index].apply(lambda x: find_closest(x,df), axis=1).to_dict()
# Fill the values
for k,v in res.items():
df.loc[k,"ratio"] = v
df.loc[k,"area"] = df.loc[k,"value"]/ df.loc[k,"ratio"]
lat long value area ratio
0 57.101474 -2.242851 12850 252.0 50.992063
1 57.102554 -2.246308 14700 309.0 47.572816
2 57.100556 -2.248342 25600 507.0 50.493097
3 57.101765 -2.254688 28000 491.0 57.026477
4 57.097553 -2.245483 5650 119.0 47.478992
5 57.098244 -2.245768 43000 811.0 53.020962
6 57.098554 -2.252504 46300 850.0 54.470588
7 57.102794 -2.243454 7850 180.0 43.611111
8 57.101474 -2.242851 26250 514.0 50.99206349
9 57.101893 -2.239883 31000 607.0 51.00502513
10 57.101383 -2.238955 28750 563.0 51.00502513
11 57.104578 -2.235641 18500 327.0 56.574924
12 57.105424 -2.234953 21950 406.0 54.064039
13 57.105516 -2.233683 19600 408.0 48.039216