如何使用带有NumPy数组的Vectorization使用Geopy库计算大数据集的测地距离? [英] How to use Vectorization with NumPy arrays to calculate geodesic distance using Geopy library for a large dataset?

查看:197
本文介绍了如何使用带有NumPy数组的Vectorization使用Geopy库计算大数据集的测地距离?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从一个包含四列纬度和经度数据以及大约三百万行的数据帧计算测地距离.我使用了应用lambda方法来执行此操作,但是花了18分钟才能完成任务.有没有一种方法可以将Vectorization与NumPy数组配合使用来加快计算速度?谢谢您的回答.

I am trying to calculate geodesic distance from a dataframe which consists of four columns of latitude and longitude data with around 3 million rows. I used the apply lambda method to do it but it took 18 minutes to finish the task. Is there a way to use Vectorization with NumPy arrays to speed up the calculation? Thank you for answering.

我的代码使用apply和lambda方法:

My code using apply and lambda method:

from geopy import distance

df['geo_dist'] = df.apply(lambda x: distance.distance(
                              (x['start_latitude'], x['start_longitude']),
                              (x['end_latitude'], x['end_longitude'])).miles, axis=1)

更新:

我正在尝试这段代码,但它给了我错误:ValueError:具有多个元素的数组的真值是不明确的.使用a.any()或a.all().感谢任何人都可以提供帮助.

I am trying this code but it gives me the error: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). Appreciate if anyone can help.

df['geo_dist'] = distance.distance(
                          (df['start_latitude'].values, df['start_longitude'].values),
                          (df['end_latitude'].values, df['end_longitude'].values)).miles

推荐答案

我认为您可以考虑为此使用geopandas,这是熊猫的扩展(因此numpy旨在非常快速地进行这些类型的计算.

I think you might consider using geopandas for this, it's an extension of pandas (and therefore numpy designed to do these types of calculations very quickly.

具体来说,它具有一种用于计算GeoSeries中各点之间的距离的方法,它可以是GeoDataFrame的一列.我相当确定该方法利用numexpr进行矢量化.

Specifically, it has a method for calculating the distance between sets of points in a GeoSeries, which can be a column of a GeoDataFrame. I’m fairly certain that this method leverages numexpr for vectorization.

应该看起来像这样,在这里您将数据框转换为具有至少两个可用于原点和点目的地的GeoSeries列的GeoDataFrame.这应该返回一个GeoSeries对象:

It should look something like this, where you convert your data frame to a GeoDataFrame with (at least) two GeoSeries columns that you can use for the origin and point destinations. This should return a GeoSeries object:

import pandas as pd
import geopandas as gpd
from shapely.geometry import Point

geometry = [Point(xy) for xy in zip(df.longitude, df.latitude)]
gdf = gpd.GeoDataFrame(df, crs={'init': 'epsg:4326'}, geometry=geometry)

distances = gdf.geometry.distance(gdf.destination_geometry)

这篇关于如何使用带有NumPy数组的Vectorization使用Geopy库计算大数据集的测地距离?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆