在大 pandas 中矢量化功能 [英] Vectorizing a function in pandas
问题描述
d = {'Provider ID':{ 0:'10001',
1:'10005',
2:'10006',
3:'10007',
4:'10008',
5:'10011',
6:'10012',
7:'10016',
8:'10018',
9:'10019'},
'latitude':{0:'31 .215379379000467',
1:'34 .22133455500045',
2:'34 .795039606000444',
3:'31 .292159523000464',
4:'31 .69311635000048 ',
5:'33 .595265517000485',
6:'34 .44060759100046',
7:'33 .254429322000476',
8:'33 .50314015000049',
9:'34 .74643089500046 '},
'longitude':{0:'-85.36146587999968',
1:'-86.15937514799964',
2:'-87.68507485299966',
3:'-86.25539902199966 ',
4:'-86.26549483099967',
5:'-86.66531866799966',
6:'-85.75726760699968',
7: '-86.81407933399964',
8:'-86.80242858299965',
9:'-87.69893502799965'}}
df = pd.DataFrame(d)
我的目标是使用haversine函数找出KM中每个项目之间的距离:
数学导入弧度,cos,sin,asin,sqrt
def haversine(lon1,lat1,lon2,lat2):
$ b $
b计算地球上两点
之间的大圆距离(以十进制度表示)
#将十进制度转换为弧度
lon1,lat1 ,lon2,lat2 = map(弧度,[lon1,lat1,lon2,lat2])
#haversine公式
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2)** 2 + cos(lat1)* cos(lat2)* sin(dlon / 2)** 2
c = 2 * asin(sqrt(a))
#6367 km是地球的半径
km = 6367 * c
return km
我的目标是获取数据框t帽子看起来像下面的result_df,值是每个提供者ID之间的距离:
result_df = pd.DataFrame(columns = df ['Provider ID'],index = df ['Provider ID']
我可以这样做在一个循环中,然而这是非常慢的。我正在寻找一些帮助,将其转换为向量化方法:
for result_df.columns中的first_hospital_coordinates:
for result_df ['Provider ID]中的second_hospital_coordinates:
如果first_hospital_coordinates =='提供商ID':
pass
else:
L1 = df [df ['Provider ID'] == first_hospital_coordinates] ['latitude']。astype('float64')。值
O1 = df [df ['Provider ID'] == first_hospital_coordinates] ['longitude']。值
L2 = df [df ['Provider ID'] == second_hospital_coordinates] ['latitude']。astype('float64')。value
O2 = df [df ['Provider ID'] =值为
distance = haversine(O1,L1,O2,L2)
crit = result_df ['second_hospital_coordinates] ['longitude']。提供者ID'] == second_hospital_coordinates
结果_df.loc [crit,first_hospital_coordinates] =距离
对此代码进行矢量化,您将需要在完整的数据框架上进行操作,而不需要在单个lat和longs上运行。我试过了这个。我需要结果df和一个新的函数h2,
import numpy as np
def h2(df,p) :
inrad = df.applymap(弧度)
dlon = inrad.longitude-inrad.longitude [p]
dlat = inrad.latitude-inrad.latitude [p]
lat1 = pd.Series(index = df.index,data = [df.latitude [p] for i in range(len(df.index))])
a = np.sin(dlat / 2)* np。 sin(dlat / 2)+ np.cos(df.latitude)* np.cos(lat1)* np.sin(dlon / 2)** 2
c = 2 * 1 / np.sin(np.sqrt (a))
km = 6367 * c
return km
df = df.set_index('Provider ID')
df = df.astype(float)
df2 = pd.DataFrame(index = df.index,columns = df.index)
df2.columns中的c:
df2 [c] = h2(df,c)
print(df2)
这应该会产生,(我不能肯定如果我有正确的答案...我的目标是矢量化代码)
提供者ID 10001 10005 10006 10007 \
提供者I D
10001 inf 5.021936e + 05 5.270062e + 05 1.649088e + 06
10005 5.021936e + 05 inf 9.294868e + 05 4.985233e + 05
10006 5.270062e + 05 9.294868e + 05 inf 4.548412e + 05
10007 1.649088e + 06 4.985233e + 05 4.548412e + 05 inf
10008 1.460299e + 06 5.777248e + 05 5.246954e + 05 3.638231e + 06
10011 6.723581 e + 05 2.004199e + 06 1.027439e + 06 6.394402e + 05
10012 4.559090e + 05 3.265536e + 06 7.573411e + 05 4.694125e + 05
10016 7.680036e + 05 1.429573e + 06 9.105474 e + 05 7.517467e + 05
10018 7.096548e + 05 1.733554e + 06 1.020976e + 06 6.701920e + 05
10019 5.436342e + 05 9.278739e + 05 2.891822e + 07 4.638858e + 05
提供者ID 10008 10011 10012 10016 \
提供者ID
10001 1.460299e + 06 6.723581e + 05 4.559090e + 05 7.680036e + 05
10005 5.777248e + 05 2.004199e + 06 3.265536e + 06 1.429573e + 06
10006 5.246954e + 05 1.027439e + 06 7.573411e + 05 9.105474e + 05
10007 3.638231e + 06 6.394402e + 05 4.694125e + 05 7.517467e + 05
10008 inf 7.766998e + 05 5.401081e + 05 9.496953e + 05
10011 7.766998e + 05 inf 1.341775e + 06 4.220911e + 06
10012 5.401081e + 05 1.341775e + 06 inf 1.119063e + 06
10016 9.496953e + 05 4.220911e + 06 1.119063e + 06 inf
10018 8.236437e + 05 1.242451e + 07 1.226941e + 06 5.866259e + 06
10019 5.372119e + 05 1.051748e + 06 7.514774e + 05 9.362341e + 05
提供者ID 10018 10019
提供者ID
10001 7.096548e + 05 5.436342e + 05
10005 1.733554e + 06 9.278739e + 05
10006 1.020976e + 06 2.891822e + 07
10007 6.701920e + 05 4.638858e + 05
10008 8.236437e + 05 5.372119e + 05
10011 1.242451e + 07 1.051748e + 06
10012 1.226941e + 06 7.514774e + 05
10016 5.866259e + 06 9.362341e + 05
10018 inf 1.048895e + 06
10019 1.048895e + 06 inf
[10行×10列]
I have a dataframe that contains a list of lat/lon coordinates:
d = {'Provider ID': {0: '10001',
1: '10005',
2: '10006',
3: '10007',
4: '10008',
5: '10011',
6: '10012',
7: '10016',
8: '10018',
9: '10019'},
'latitude': {0: '31.215379379000467',
1: '34.22133455500045',
2: '34.795039606000444',
3: '31.292159523000464',
4: '31.69311635000048',
5: '33.595265517000485',
6: '34.44060759100046',
7: '33.254429322000476',
8: '33.50314015000049',
9: '34.74643089500046'},
'longitude': {0: ' -85.36146587999968',
1: ' -86.15937514799964',
2: ' -87.68507485299966',
3: ' -86.25539902199966',
4: ' -86.26549483099967',
5: ' -86.66531866799966',
6: ' -85.75726760699968',
7: ' -86.81407933399964',
8: ' -86.80242858299965',
9: ' -87.69893502799965'}}
df = pd.DataFrame(d)
My goal is to use the haversine function to figure out the distances between every item in KM:
from math import radians, cos, sin, asin, sqrt
def haversine(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
# 6367 km is the radius of the Earth
km = 6367 * c
return km
My goal is to get a dataframe that looks like the result_df below where the values are the distance between each provider id:
result_df = pd.DataFrame(columns = df['Provider ID'], index=df['Provider ID'])
I can do this in a loop, however it's terribly slow. I'm looking for some help in converting this to a vectorized method:
for first_hospital_coordinates in result_df.columns:
for second_hospital_coordinates in result_df['Provider ID']:
if first_hospital_coordinates == 'Provider ID':
pass
else:
L1 = df[df['Provider ID'] == first_hospital_coordinates]['latitude'].astype('float64').values
O1 = df[df['Provider ID'] == first_hospital_coordinates]['longitude'].astype('float64').values
L2 = df[df['Provider ID'] == second_hospital_coordinates]['latitude'].astype('float64').values
O2 = df[df['Provider ID'] == second_hospital_coordinates]['longitude'].astype('float64').values
distance = haversine(O1, L1, O2, L2)
crit = result_df['Provider ID'] == second_hospital_coordinates
result_df.loc[crit, first_hospital_coordinates] = distance
To vectorize this code, you will need to operate on complete dataframe and not on the individual lats and longs. I have made an attempt at this. I need the result df and a new function h2,
import numpy as np
def h2(df, p):
inrad = df.applymap(radians)
dlon = inrad.longitude-inrad.longitude[p]
dlat = inrad.latitude-inrad.latitude[p]
lat1 = pd.Series(index = df.index, data = [df.latitude[p] for i in range(len(df.index))])
a = np.sin(dlat/2)*np.sin(dlat/2) + np.cos(df.latitude) * np.cos(lat1) * np.sin(dlon/2)**2
c = 2 * 1/np.sin(np.sqrt(a))
km = 6367 * c
return km
df = df.set_index('Provider ID')
df = df.astype(float)
df2 = pd.DataFrame(index = df.index, columns = df.index)
for c in df2.columns:
df2[c] = h2(df, c)
print (df2)
This should yield, (I can't be sure if I have the correct answer... my goal was to vectorize the code)
Provider ID 10001 10005 10006 10007 \
Provider ID
10001 inf 5.021936e+05 5.270062e+05 1.649088e+06
10005 5.021936e+05 inf 9.294868e+05 4.985233e+05
10006 5.270062e+05 9.294868e+05 inf 4.548412e+05
10007 1.649088e+06 4.985233e+05 4.548412e+05 inf
10008 1.460299e+06 5.777248e+05 5.246954e+05 3.638231e+06
10011 6.723581e+05 2.004199e+06 1.027439e+06 6.394402e+05
10012 4.559090e+05 3.265536e+06 7.573411e+05 4.694125e+05
10016 7.680036e+05 1.429573e+06 9.105474e+05 7.517467e+05
10018 7.096548e+05 1.733554e+06 1.020976e+06 6.701920e+05
10019 5.436342e+05 9.278739e+05 2.891822e+07 4.638858e+05
Provider ID 10008 10011 10012 10016 \
Provider ID
10001 1.460299e+06 6.723581e+05 4.559090e+05 7.680036e+05
10005 5.777248e+05 2.004199e+06 3.265536e+06 1.429573e+06
10006 5.246954e+05 1.027439e+06 7.573411e+05 9.105474e+05
10007 3.638231e+06 6.394402e+05 4.694125e+05 7.517467e+05
10008 inf 7.766998e+05 5.401081e+05 9.496953e+05
10011 7.766998e+05 inf 1.341775e+06 4.220911e+06
10012 5.401081e+05 1.341775e+06 inf 1.119063e+06
10016 9.496953e+05 4.220911e+06 1.119063e+06 inf
10018 8.236437e+05 1.242451e+07 1.226941e+06 5.866259e+06
10019 5.372119e+05 1.051748e+06 7.514774e+05 9.362341e+05
Provider ID 10018 10019
Provider ID
10001 7.096548e+05 5.436342e+05
10005 1.733554e+06 9.278739e+05
10006 1.020976e+06 2.891822e+07
10007 6.701920e+05 4.638858e+05
10008 8.236437e+05 5.372119e+05
10011 1.242451e+07 1.051748e+06
10012 1.226941e+06 7.514774e+05
10016 5.866259e+06 9.362341e+05
10018 inf 1.048895e+06
10019 1.048895e+06 inf
[10 rows x 10 columns]
这篇关于在大 pandas 中矢量化功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!