如何有效地比较pandas DataFrame中的行? [英] How to efficiently compare rows in a pandas DataFrame?

查看:91
本文介绍了如何有效地比较pandas DataFrame中的行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个熊猫数据框,其中包含雷击的记录,其中包含时间戳和全球位置,格式如下:

I have a pandas dataframe containing a record of lightning strikes with timestamps and global positions in the following format:

Index      Date      Time                        Lat      Lon         Good fix?
0          1         20160101  00:00:00.9962692  -7.1961  -60.7604    1
1          2         20160101  00:00:01.0646207  -7.0518  -60.6911    1
2          3         20160101  00:00:01.1102066 -25.3913  -57.2922    1
3          4         20160101  00:00:01.2018573  -7.4842  -60.5129    1
4          5         20160101  00:00:01.2942750  -7.3939  -60.4992    1
5          6         20160101  00:00:01.4431493  -9.6386  -62.8448    1
6          8         20160101  00:00:01.5226157 -23.7089  -58.8888    1
7          9         20160101  00:00:01.5932412  -6.3513  -55.6545    1
8          10        20160101  00:00:01.6736350 -23.8019  -58.9382    1
9          11        20160101  00:00:01.6957858 -24.5724  -57.7229    1

实际数据框包含数百万行.我希望将在时间和空间上发生的事件与其他事件分开,并将它们存储在新的数据框isolated_fixes中.我已经编写了代码来计算任何两个事件的间隔,如下所示:

Actual dataframe contains millions of rows. I wish to separate out events which happened far away in space and time from other events, and store them in a new dataframe isolated_fixes. I have written code to calculate the separation of any two events which is as follows:

def are_strikes_space_close(strike1,strike2,defclose=100,latpos=3,lonpos=4): #Uses haversine formula to calculate distance between points, returning a tuple with Boolean closeness statement, and numerical distance
    radlat1 = m.radians(strike1[1][latpos])
    radlon1 = m.radians(strike1[1][lonpos])
    radlat2 = m.radians(strike2[1][latpos])
    radlon2 = m.radians(strike2[1][lonpos])

    a=(m.sin((radlat1-radlat2)/2)**2) + m.cos(radlat1)*m.cos(radlat2)*(m.sin((radlon1-radlon2)/2)**2)
    c=2*m.atan2((a**0.5),((1-a)**0.5))
    R=6371 #earth radius in km
    d=R*c #distance between points in km
    if d <= defclose:
        return (True,d)
    else:
        return (False,d) 

时间:

def getdatetime(series,timelabel=2,datelabel=1,timeformat="%X.%f",dateformat="%Y%m%d"):
    time = dt.datetime.strptime(series[1][timelabel][:15], timeformat)
    date = dt.datetime.strptime(str(series[1][datelabel]), dateformat)
    datetime = dt.datetime.combine(date.date(),time.time())
    return datetime


def are_strikes_time_close(strike1,strike2,defclose=dt.timedelta(0,7200,0)):
    dt1=getdatetime(strike1)
    dt2=getdatetime(strike2)
    timediff=abs(dt1-dt2)
    if timediff<=defclose:
        return(True, timediff)
    else:
        return(False, timediff)

真正的问题是如何有效地将所有事件与所有其他事件进行比较,以确定它们中有多少是space_close和time_close.

The real problem is how to efficiently compare all events to all other events to determine how many of them are space_close and time_close.

请注意,并非所有事件都需要检查,因为它们是根据日期时间排序的,因此,如果有一种方法可以中间"检查事件,然后在事件不再及时关闭时停止,则可以节省时间很多操作,但我不知道该怎么做.

Note that not all events need to be checked, as they are ordered with respect to datetime, so if there was a way to check events 'middle out' and then stop when events were no longer close in time, that would save a lot of operations, but I dont know how to do this.

此刻,我的(非功能性)尝试如下所示:

At the moment, my (nonfunctional) attempt looks like this:

def extrisolfixes(data,filtereddata,defisol=4): 
    for strike1 in data.iterrows():
        near_strikes=-1 #-1 to account for self counting once on each loop
        for strike2 in data.iterrows():
            if are_strikes_space_close(strike1,strike2)[0]==True and are_strikes_time_close(strike1,strike2)[0]==True:
                near_strikes+=1
        if near_strikes<=defisol:
            filtereddata=filtereddata.append(strike1)

感谢您的帮助!如果需要,我们很乐意提供澄清.

Thanks for any help! Am happy to provide clarification if needed.

推荐答案

此答案可能不是很有效.我面临着一个非常相似的问题,并且目前正在寻找一种比我的方法更有效的方法,因为在我的数据帧(60万行)上计算仍需要一个小时.

This answer might not be very efficient. I'm facing a very similar problem and am currently looking for something more efficient than what I do because it still takes one hour to compute on my dataframe (600k rows).

我首先建议您不要像您一样考虑使用for循环.您可能无法避免一个(这是我使用apply所做的事情),但是第二个可以(必须)被矢量化.

I first suggest you don't even think about using for loops like you do. You might not be able to avoid one (which is what I do using apply), but the second can (must) be vectorized.

该技术的想法是在数据框中创建一个新列,以存储附近是否在附近(临时和空间上)罢工.

The idea of this technique is to create a new column in the dataframe storing whether there is another strike nearby (temporarly and spatially).

首先,让我们创建一个函数(使用numpy包)来计算一个打击(reference)与所有其他打击之间的距离:

First let's create a function calculating (with numpy package) the distances between one strike (reference) and all the others:

def get_distance(reference,other_strikes):

    radius = 6371.00085 #radius of the earth
    # Get lats and longs in radians, then compute deltas:
    lat1 = np.radians(other_strikes.Lat)
    lat2 = np.radians(reference[0])
    dLat = lat2-lat1
    dLon = np.radians(reference[1]) - np.radians(other_strikes.Lon)
    # And compute the distance (in km)
    a = np.sin(dLat / 2.0) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(dLon / 2.0) ** 2
    return 2 * np.arcsin(np.minimum(1, np.sqrt(a))) * radius

然后创建一个函数,该函数将检查一个给定的打击是否在附近至少存在另一个:

Then create a function that will check whether, for one given strike, there is at least another nearby:

def is_there_a_strike_nearby(date_ref, lat_ref, long_ref, delta_t, delta_d, other_strikes):
    dmin = date_ref - np.timedelta64(delta_t,'D')
    dmax = date_ref + np.timedelta64(delta_t,'D')

    #Let's first find all strikes within a temporal range
    ind = other_strikes.Date.searchsorted([date_ref-delta_t,date_ref+delta_t])
    nearby_strikes = other_strikes.loc[ind[0]:ind[1]-1].copy()

    if len(nearby_strikes) == 0:
        return False

    #Let's compute spatial distance now:
    nearby_strikes['distance'] = get_distance([lat_ref,long_ref], nearby_strikes[['Lat','Lon']])

    nearby_strikes = nearby_strikes[nearby_strikes['distance']<=delta_d]

    return (len(nearbystrikes)>0)

现在您已经准备好所有功能,可以在数据框上使用apply:

Now that all your functions are ready, you can use apply on your dataframe:

data['presence of nearby strike'] = data[['Date','Lat','Lon']].apply(lambda x: is_there_a_strike_nearby(x['Date'],x['Lat'],x['Long'], delta_t, delta_d,data)

就是这样,您现在已经在数据框中创建了一个新列,该行指示您的罢工是隔离的(False)还是非隔离的(True),从中轻松创建新的数据框.

And that's it, you have now created a new column in your dataframe that indicates whether your strike is isolated (False) or not (True), creating your new dataframe from this is easy.

此方法的问题是它仍然需要很长时间才能打开.有多种方法可以使速度更快,例如,将is_there_a_strike_nearby更改为data按lat和long排序的其他参数,并在计算距离之前使用其他searchsorted过滤LatLong(例如,如果您希望打击范围在10公里之内,则可以将delta_Lat设为0.09)进行过滤.

The problem of this method is that it still is long to turn. There are ways to make it faster, for instance change is_there_a_strike_nearby to take as other arguments your data sorted by lat and long, and using other searchsorted to filter over Lat and Long before computing the distance (for instance if you want the strikes within a range of 10km, you can filter with a delta_Lat of 0.09).

对此方法的任何反馈都非常受欢迎!

Any feedback over this method is more than welcome!

这篇关于如何有效地比较pandas DataFrame中的行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆