由最近的已知邻居以特定的纬度/经度填充丢失的数据 [英] Gapfilling missing data at specific latitude/longitude by nearest known neighbours

查看:82
本文介绍了由最近的已知邻居以特定的纬度/经度填充丢失的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大约200万行的数据集,由特定纬度和经度的各种属性组成.对于每个属性,我都有一个评估值和一个建筑面积.评估已经完成,但并非所有物业都有占地面积.

我想使用一些最接近的邻居方法进行插值,以近似表中特定缺少的NaN值.我的软件是用Python编写的,因此可能需要使用Numpy,Pandas,Scipy或某种组合.

我看过使用SciPy的 cKDTree ,以及使用解决方案

如果您对图书馆开放,则可以使用

结果

    lat         long        value   area    ratio
0   57.101474   -2.242851   12850   252.0   50.992063
1   57.102554   -2.246308   14700   309.0   47.572816
2   57.100556   -2.248342   25600   507.0   50.493097
3   57.101765   -2.254688   28000   491.0   57.026477
4   57.097553   -2.245483   5650    119.0   47.478992
5   57.098244   -2.245768   43000   811.0   53.020962
6   57.098554   -2.252504   46300   850.0   54.470588
7   57.102794   -2.243454   7850    180.0   43.611111
8   57.101474   -2.242851   26250   514.0   50.99206349
9   57.101893   -2.239883   31000   607.0   51.00502513
10  57.101383   -2.238955   28750   563.0   51.00502513
11  57.104578   -2.235641   18500   327.0   56.574924
12  57.105424   -2.234953   21950   406.0   54.064039
13  57.105516   -2.233683   19600   408.0   48.039216

I have a dataset of about 2 million rows, consisting of various properties at specific latitudes and longitudes. For each property, I have a valuation and a floor area. The valuations are complete but not all properties have floor areas.

I want to interpolate using some nearest neighbours method to approximate for the specific missing NaN values in the table. My software is written in Python, so probably requires using Numpy, Pandas, Scipy or some combination.

I've had a look at using SciPy's cKDTree, as well as some distance approximation using a Haversine formula to calculate distances, but all the examples I've seen are about interpolating across a plane rather than for gap-filling missing data, and I'm a bit at a loss as to how to achieve this.

As an example, here's the first few rows of what I've been using as test data (ratio is simply value/area):

lat       | long      | value | area  | ratio
----------|-----------|-------|-------|----------
57.101474 | -2.242851 | 12850 | 252.0 | 50.992063
57.102554 | -2.246308 | 14700 | 309.0 | 47.572816
57.100556 | -2.248342 | 25600 | 507.0 | 50.493097
57.101765 | -2.254688 | 28000 | 491.0 | 57.026477
57.097553 | -2.245483 | 5650  | 119.0 | 47.478992
57.098244 | -2.245768 | 43000 | 811.0 | 53.020962
57.098554 | -2.252504 | 46300 | 850.0 | 54.470588
57.102794 | -2.243454 | 7850  | 180.0 | 43.611111
57.101474 | -2.242851 | 26250 | NaN   | NaN
57.101893 | -2.239883 | 31000 | NaN   | NaN
57.101383 | -2.238955 | 28750 | NaN   | NaN
57.104578 | -2.235641 | 18500 | 327.0 | 56.574924
57.105424 | -2.234953 | 21950 | 406.0 | 54.064039
57.105516 | -2.233683 | 19600 | 408.0 | 48.039216

The properties themselves can be further grouped to get better relationships (this isn't part of the test data, but each property can be used for a different purpose, e.g. office, factory, shop).

I realise I can loop through this slowly, getting groups of properties by distance apart (testing each NaN property against the rest) but that would seem to be heartbreakingly glacial.

df.to_clipboard() output (first 15 rows):

    lat         long        value   area    ratio
0   57.101474   -2.242851   12850   252.0   50.992063
1   57.102554   -2.246308   14700   309.0   47.572816
2   57.100556   -2.248342   25600   507.0   50.493097
3   57.101765   -2.254688   28000   491.0   57.026477
4   57.097553   -2.245483   5650    119.0   47.478992
5   57.098244   -2.245768   43000   811.0   53.020962
6   57.098554   -2.252504   46300   850.0   54.470588
7   57.102794   -2.243454   7850    180.0   43.611111
8   57.101474   -2.242851   26250   NaN     NaN
9   57.101893   -2.239883   31000   NaN     NaN
10  57.101383   -2.238955   28750   NaN     NaN
11  57.104578   -2.235641   18500   327.0   56.574924
12  57.105424   -2.234953   21950   406.0   54.064039
13  57.105516   -2.233683   19600   408.0   48.039216

解决方案

If you are open to libraries, you can use a Distance matrix

Assuming df your main dataframe

import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
import pandas as pd

def find_closest(x, df):
    #Supress itself
    d = x.drop(x.name).to_dict()
    #sort the distance
    v = sorted(d, key=lambda k: d[k])
    #Find the closest with a non nan area value else return NaN
    for i in v :
        if i in df[~df.area.isnull()].index:
            return df.loc[i].ratio
        else:
            pass
    return np.nan
df_matrix_distance = pd.DataFrame(euclidean_distances(df[["lat","long"]]))
#Get the null values in area
df_nan = df[df.area.isnull()]
#get the values
res = df_matrix_distance.loc[df_nan.index].apply(lambda x: find_closest(x,df), axis=1).to_dict()
# Fill the values
for k,v in res.items():
    df.loc[k,"ratio"] = v
    df.loc[k,"area"] = df.loc[k,"value"]/ df.loc[k,"ratio"]

The result

    lat         long        value   area    ratio
0   57.101474   -2.242851   12850   252.0   50.992063
1   57.102554   -2.246308   14700   309.0   47.572816
2   57.100556   -2.248342   25600   507.0   50.493097
3   57.101765   -2.254688   28000   491.0   57.026477
4   57.097553   -2.245483   5650    119.0   47.478992
5   57.098244   -2.245768   43000   811.0   53.020962
6   57.098554   -2.252504   46300   850.0   54.470588
7   57.102794   -2.243454   7850    180.0   43.611111
8   57.101474   -2.242851   26250   514.0   50.99206349
9   57.101893   -2.239883   31000   607.0   51.00502513
10  57.101383   -2.238955   28750   563.0   51.00502513
11  57.104578   -2.235641   18500   327.0   56.574924
12  57.105424   -2.234953   21950   406.0   54.064039
13  57.105516   -2.233683   19600   408.0   48.039216

这篇关于由最近的已知邻居以特定的纬度/经度填充丢失的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆