在大 pandas 中矢量化功能 [英] Vectorizing a function in pandas

查看：96 发布时间：2017/3/26 4:33:32 python pandas dataframe

本文介绍了在大 pandas 中矢量化功能的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个包含lat / lon坐标列表的数据框：

  d = {'Provider ID'：{ 0：'10001'，
 1：'10005'，
 2：'10006'，
 3：'10007'，
 4：'10008'，
 5：'10011'，
 6：'10012'，
 7：'10016'，
 8：'10018'，
 9：'10019'}，
'latitude'：{0：'31 .215379379000467'，
 1：'34 .22133455500045'，
 2：'34 .795039606000444'，
 3：'31 .292159523000464'，
 4：'31 .69311635000048 '，
 5：'33 .595265517000485'，
 6：'34 .44060759100046'，
 7：'33 .254429322000476'，
 8：'33 .50314015000049'，
 9：'34 .74643089500046 '}，
'longitude'：{0：'-85.36146587999968'，
 1：'-86.15937514799964'，
 2：'-87.68507485299966'，
 3：'-86.25539902199966 '，
 4：'-86.26549483099967'，
 5：'-86.66531866799966'，
 6：'-85.75726760699968'，
 7： '-86.81407933399964'，
 8：'-86.80242858299965'，
 9：'-87.69893502799965'}} 
 df = pd.DataFrame（d）

我的目标是使用haversine函数找出KM中每个项目之间的距离：

数学导入弧度，cos，sin，asin，sqrt
def haversine（lon1，lat1，lon2，lat2）：
$ b $

 b计算地球上两点
之间的大圆距离（以十进制度表示）

 
＃将十进制度转换为弧度
 lon1，lat1 ，lon2，lat2 = map（弧度，[lon1，lat1，lon2，lat2]）
 
＃haversine公式
 dlon = lon2  -  lon1 
 dlat = lat2  -  lat1 
a = sin（dlat / 2）** 2 + cos（lat1）* cos（lat2）* sin（dlon / 2）** 2 
c = 2 * asin（sqrt（a））
 
＃6367 km是地球的半径
 km = 6367 * c 
 return km

我的目标是获取数据框t帽子看起来像下面的result_df，值是每个提供者ID之间的距离：

  result_df = pd.DataFrame（columns = df ['Provider ID']，index = df ['Provider ID']

我可以这样做在一个循环中，然而这是非常慢的。我正在寻找一些帮助，将其转换为向量化方法：

  for result_df.columns中的first_hospital_coordinates：
 for result_df ['Provider ID]中的second_hospital_coordinates：
如果first_hospital_coordinates =='提供商ID'：
 pass 
 else：
 L1 = df [df ['Provider ID'] == first_hospital_coordinates] ['latitude']。astype（'float64'）。值
 O1 = df [df ['Provider ID'] == first_hospital_coordinates] ['longitude']。值
 L2 = df [df ['Provider ID'] == second_hospital_coordinates] ['latitude']。astype（'float64'）。value 
 O2 = df [df ['Provider ID'] =值为
 
 distance = haversine（O1，L1，O2，L2）
 
 crit = result_df ['second_hospital_coordinates] ['longitude']。提供者ID'] == second_hospital_coordinates 
结果_df.loc [crit，first_hospital_coordinates] =距离

解决方案

对此代码进行矢量化，您将需要在完整的数据框架上进行操作，而不需要在单个lat和longs上运行。我试过了这个。我需要结果df和一个新的函数h2，

  import numpy as np 
 def h2（df，p） ：
 inrad = df.applymap（弧度）
 dlon = inrad.longitude-inrad.longitude [p] 
 dlat = inrad.latitude-inrad.latitude [p] 
 lat1 = pd.Series（index = df.index，data = [df.latitude [p] for i in range（len（df.index））]）
a = np.sin（dlat / 2）* np。 sin（dlat / 2）+ np.cos（df.latitude）* np.cos（lat1）* np.sin（dlon / 2）** 2 
c = 2 * 1 / np.sin（np.sqrt （a））
 km = 6367 * c 
 return km 
 
 df = df.set_index（'Provider ID'）
 df = df.astype（float） 
 df2 = pd.DataFrame（index = df.index，columns = df.index）
 df2.columns中的c：
 df2 [c] = h2（df，c）
 
 print（df2）

这应该会产生，（我不能肯定如果我有正确的答案...我的目标是矢量化代码）

 提供者ID 10001 10005 10006 10007 \ 
提供者I D 
 10001 inf 5.021936e + 05 5.270062e + 05 1.649088e + 06 
 10005 5.021936e + 05 inf 9.294868e + 05 4.985233e + 05 
 10006 5.270062e + 05 9.294868e + 05 inf 4.548412e + 05 
 10007 1.649088e + 06 4.985233e + 05 4.548412e + 05 inf 
 10008 1.460299e + 06 5.777248e + 05 5.246954e + 05 3.638231e + 06 
 10011 6.723581 e + 05 2.004199e + 06 1.027439e + 06 6.394402e + 05 
 10012 4.559090e + 05 3.265536e + 06 7.573411e + 05 4.694125e + 05 
 10016 7.680036e + 05 1.429573e + 06 9.105474 e + 05 7.517467e + 05 
 10018 7.096548e + 05 1.733554e + 06 1.020976e + 06 6.701920e + 05 
 10019 5.436342e + 05 9.278739e + 05 2.891822e + 07 4.638858e + 05 
 
提供者ID 10008 10011 10012 10016 \ 
提供者ID 
 10001 1.460299e + 06 6.723581e + 05 4.559090e + 05 7.680036e + 05 
 10005 5.777248e + 05 2.004199e + 06 3.265536e + 06 1.429573e + 06 
 10006 5.246954e + 05 1.027439e + 06 7.573411e + 05 9.105474e + 05 
 10007 3.638231e + 06 6.394402e + 05 4.694125e + 05 7.517467e + 05 
 10008 inf 7.766998e + 05 5.401081e + 05 9.496953e + 05 
 10011 7.766998e + 05 inf 1.341775e + 06 4.220911e + 06 
 10012 5.401081e + 05 1.341775e + 06 inf 1.119063e + 06 
 10016 9.496953e + 05 4.220911e + 06 1.119063e + 06 inf 
 10018 8.236437e + 05 1.242451e + 07 1.226941e + 06 5.866259e + 06 
 10019 5.372119e + 05 1.051748e + 06 7.514774e + 05 9.362341e + 05 
 
提供者ID 10018 10019 
提供者ID 
 10001 7.096548e + 05 5.436342e + 05 
 10005 1.733554e + 06 9.278739e + 05 
 10006 1.020976e + 06 2.891822e + 07 
 10007 6.701920e + 05 4.638858e + 05 
 10008 8.236437e + 05 5.372119e + 05 
 10011 1.242451e + 07 1.051748e + 06 
 10012 1.226941e + 06 7.514774e + 05 
 10016 5.866259e + 06 9.362341e + 05 
 10018 inf 1.048895e + 06 
 10019 1.048895e + 06 inf 
 
 [10行×10列]

I have a dataframe that contains a list of lat/lon coordinates:

d = {'Provider ID': {0: '10001',
  1: '10005',
  2: '10006',
  3: '10007',
  4: '10008',
  5: '10011',
  6: '10012',
  7: '10016',
  8: '10018',
  9: '10019'},
 'latitude': {0: '31.215379379000467',
  1: '34.22133455500045',
  2: '34.795039606000444',
  3: '31.292159523000464',
  4: '31.69311635000048',
  5: '33.595265517000485',
  6: '34.44060759100046',
  7: '33.254429322000476',
  8: '33.50314015000049',
  9: '34.74643089500046'},
 'longitude': {0: ' -85.36146587999968',
  1: ' -86.15937514799964',
  2: ' -87.68507485299966',
  3: ' -86.25539902199966',
  4: ' -86.26549483099967',
  5: ' -86.66531866799966',
  6: ' -85.75726760699968',
  7: ' -86.81407933399964',
  8: ' -86.80242858299965',
  9: ' -87.69893502799965'}}
df = pd.DataFrame(d)

My goal is to use the haversine function to figure out the distances between every item in KM:

from math import radians, cos, sin, asin, sqrt
def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """

    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 

    # 6367 km is the radius of the Earth
    km = 6367 * c
    return km

My goal is to get a dataframe that looks like the result_df below where the values are the distance between each provider id:

 result_df = pd.DataFrame(columns = df['Provider ID'], index=df['Provider ID'])

I can do this in a loop, however it's terribly slow. I'm looking for some help in converting this to a vectorized method:

for first_hospital_coordinates in result_df.columns:
    for second_hospital_coordinates in result_df['Provider ID']:
        if first_hospital_coordinates == 'Provider ID':
            pass
        else:
            L1 = df[df['Provider ID'] == first_hospital_coordinates]['latitude'].astype('float64').values
            O1 = df[df['Provider ID'] == first_hospital_coordinates]['longitude'].astype('float64').values
            L2 = df[df['Provider ID'] == second_hospital_coordinates]['latitude'].astype('float64').values
            O2 = df[df['Provider ID'] == second_hospital_coordinates]['longitude'].astype('float64').values

            distance = haversine(O1, L1, O2, L2)

            crit = result_df['Provider ID'] == second_hospital_coordinates
            result_df.loc[crit, first_hospital_coordinates] = distance

解决方案

To vectorize this code, you will need to operate on complete dataframe and not on the individual lats and longs. I have made an attempt at this. I need the result df and a new function h2,

import numpy as np
def h2(df, p):
    inrad = df.applymap(radians)
    dlon = inrad.longitude-inrad.longitude[p]
    dlat = inrad.latitude-inrad.latitude[p]
    lat1 = pd.Series(index = df.index, data = [df.latitude[p] for i in range(len(df.index))])
    a = np.sin(dlat/2)*np.sin(dlat/2) + np.cos(df.latitude) * np.cos(lat1) * np.sin(dlon/2)**2
    c = 2 * 1/np.sin(np.sqrt(a))
    km = 6367 * c
    return km

df = df.set_index('Provider ID')
df = df.astype(float)
df2 = pd.DataFrame(index = df.index, columns = df.index)
for c in df2.columns:
    df2[c] = h2(df, c)

print (df2)

This should yield, (I can't be sure if I have the correct answer... my goal was to vectorize the code)

Provider ID         10001         10005         10006         10007  \
Provider ID                                                           
10001                 inf  5.021936e+05  5.270062e+05  1.649088e+06   
10005        5.021936e+05           inf  9.294868e+05  4.985233e+05   
10006        5.270062e+05  9.294868e+05           inf  4.548412e+05   
10007        1.649088e+06  4.985233e+05  4.548412e+05           inf   
10008        1.460299e+06  5.777248e+05  5.246954e+05  3.638231e+06   
10011        6.723581e+05  2.004199e+06  1.027439e+06  6.394402e+05   
10012        4.559090e+05  3.265536e+06  7.573411e+05  4.694125e+05   
10016        7.680036e+05  1.429573e+06  9.105474e+05  7.517467e+05   
10018        7.096548e+05  1.733554e+06  1.020976e+06  6.701920e+05   
10019        5.436342e+05  9.278739e+05  2.891822e+07  4.638858e+05   

Provider ID         10008         10011         10012         10016  \
Provider ID                                                           
10001        1.460299e+06  6.723581e+05  4.559090e+05  7.680036e+05   
10005        5.777248e+05  2.004199e+06  3.265536e+06  1.429573e+06   
10006        5.246954e+05  1.027439e+06  7.573411e+05  9.105474e+05   
10007        3.638231e+06  6.394402e+05  4.694125e+05  7.517467e+05   
10008                 inf  7.766998e+05  5.401081e+05  9.496953e+05   
10011        7.766998e+05           inf  1.341775e+06  4.220911e+06   
10012        5.401081e+05  1.341775e+06           inf  1.119063e+06   
10016        9.496953e+05  4.220911e+06  1.119063e+06           inf   
10018        8.236437e+05  1.242451e+07  1.226941e+06  5.866259e+06   
10019        5.372119e+05  1.051748e+06  7.514774e+05  9.362341e+05   

Provider ID         10018         10019  
Provider ID                              
10001        7.096548e+05  5.436342e+05  
10005        1.733554e+06  9.278739e+05  
10006        1.020976e+06  2.891822e+07  
10007        6.701920e+05  4.638858e+05  
10008        8.236437e+05  5.372119e+05  
10011        1.242451e+07  1.051748e+06  
10012        1.226941e+06  7.514774e+05  
10016        5.866259e+06  9.362341e+05  
10018                 inf  1.048895e+06  
10019        1.048895e+06           inf  

[10 rows x 10 columns]

这篇关于在大 pandas 中矢量化功能的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在大 pandas 中矢量化功能 [英] Vectorizing a function in pandas

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在大 pandas 中矢量化功能 [英] Vectorizing a function in pandas

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭