快速(矢量化)的方式来找到属于同样大小的矩形(由两点给出)的一个DF中的点 [英] Fast (vectorized) way to find points in one DF belonging to equally sized rectangles (given by two points) from the second DF

查看:173
本文介绍了快速(矢量化)的方式来找到属于同样大小的矩形(由两点给出)的一个DF中的点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有数据框A,如下所示:

  type latw lngs late lngn 
0 1000 45.457966 9.174864 45.458030 9.174907
1 1000 45.457966 9.174864 45.458030 9.174907
2 1000 45.458030 9.174864 45.458094 9.174907
3 1000 45.458094 9.174864 45.458157 9.174907
4 1000 45.458157 9.174864 45.458221 9.174907
5 1000 45.458221 9.174864 45.458285 9.174907
6 1000 45.458285 9.174864 45.458349 9.174907
7 1000 45.458349 9.174864 45.458413 9.174907
8 1000 45.458413 9.174864 45.458477 9.174907
9 1000 45.458477 9.174864 45.458540 9.174907
10 1000 45.458540 9.174864 45.458604 9.174907
11 1000 45.458604 9.174864 45.458668 9.174907
12 1000 45.458668 9.174864 45.458732 9.174907
13 1000 45.458732 9.174864 45.458796 9.174907
14 1000 45.4 58796 9.174864 45.458860 9.174907
15 1000 45.458860 9.174864 45.458923 9.174907
16 1000 45.458923 9.174864 45.458987 9.174907
17 1000 45.458987 9.174864 45.459051 9.174907
18 1000 45.459051 9.174864 45.459115 9.174907
19 1000 45.459115 9.174864 45.459179 9.174907
20 1000 45.459179 9.174864 45.459243 9.174907
21 1000 45.459243 9.174864 45.459306 9.174907
22 1000 45.459306 9.174864 45.459370 9.174907
23 1000 45.459370 9.174864 45.459434 9.174907
24 1000 45.459434 9.174864 45.459498 9.174907
25 1000 45.459498 9.174864 45.459562 9.174907
26 1000 45.459562 9.174864 45.459626 9.174907
27 1000 45.459626 9.174864 45.459689 9.174907
28 1000 45.459689 9.174864 45.459753 9.174907
29 1000 45.459753 9.174864 45.459817 9.174907
... ... ... ... ... ...
970 1000 45.460583 9.175545 45.460647 9.175587
971 1000 45.460647 9.175545 45.460711 9.175587
972 1000 45.460711 9.175545 45.460775 9.175587
973 1000 45.460775 9.175545 45.460838 9.175587
974 1000 45.460838 9.175545 45.460902 9.175587
975 1000 45.460902 9.175545 45.460966 9.175587
976 1000 45.460966 9.175545 45.461030 9.175587
977 1000 45.461030 9.175545 45.461094 9.175587
978 1000 45.461094 9.175545 45.461157 9.175587
979 1000 45.461157 9.175545 45.461221 9.175587
980 1000 45.461221 9.175545 45.461285 9.175587
981 1000 45.461285 9.175545 45.461349 9.175587
982 1000 45.461349 9.175545 45.461413 9.175587
983 1000 45.461413 9.175545 45.461477 9.175587
984 1000 45.461477 9.175545 45.461540 9.175587
985 1000 45.461540 9.175545 45.461604 9.175587
986 1000 45.461604 9.175545 45.461668 9.175587
987 1000 45.457966 9.175587 45.458030 9.175630
988 1000 45.458030 9.175587 45.458094 9.175630
989 1000 45.458094 9.175587 45.458157 9.175630
990 1000 45.458157 9.175587 45.458221 9.175630
991 1000 45.458221 9.175587 45.458285 9.175630
992 1000 45.458285 9.175587 45.458349 9.175630
993 1000 45.458349 9.175587 45.458413 9.175630
994 1000 45.458413 9.175587 45.458477 9.175630
995 1000 45.458477 9.175587 45.458540 9.175630
996 1000 45.458540 9.175587 45.458604 9.175630
997 1000 45.458604 9.175587 45.458668 9.175630
998 1000 45.458668 9.175587 45.458732 9.175630
999 1000 45.458732 9.175587 45.458796 9.175630

它有22,000,000行×5列,数据框B如下所示:

 类型Lat Lng 
0 0 45.465739 9.180830
1 2 45.463950 9.187113
2 1 45.468015 9.180648
3 1 45.462209 9.187447
4 0 45.459578 9.184007
5 1 45.459822 9.187034
6 2 45.454988 9.180310
7 2 45.459818 9.189377
8 0 45.462200 9.187440
9 0 45.467160 9.180100
10 2 45.459407 9.183300
11 2 45.457699 9.187434
12 1 45.455319 9.186697
13 0 45.461138 9.191943
14 2 45.456397 9.189028
15 0 45.457062 9.185878
16 1 45.461980 9.187024
17 1 45.464319 9.183142
18 2 45.464227 9.187065
19 1 45.460886 9.185216

它有2,000,000行×3列。
我想用B替换数据框A的类型的值,其中:

  A [latw] < B [lat]< A [late]和A [lngs]< B [lng]< B [lngn] 

我想检查一个B中的位置属于A中的哪个矩形。



PS我正在寻找最快的方式在python例如使用并行处理。

解决方案

高级说明



这个答案是很相关的。以下是高级别点




  • 使用



    但是,为了交换类型值,我在进程中添加了一个标志函数重新分配。只需运行进程(A,B,reassign_type = True)



    >

    功能

    我定义了返回相同内容的功能,并使其成为苹果比较

      def pir(A,B):
    return A.values [process(A,B) ,0],0]

    def maxu(A,B):
    return B.apply(
    lambda x:A.loc [
    (A.latw < x.Lat)&
    (x.Lat< A.late)&
    (A.lngs< x.Lng)&
    (x.Lng& A.lngn),'type'
    ] .head(1),
    axis = 1
    ).values.diagonal()
    pre>



    更多彻底测试

      from timeit import timeit 
    from itertools import product

    rng = np.arange(20,51,5)
    idx = pd.MultiIndex.from_arrays([rng,rng],names = ['B','A '])
    functions = pd.Index(['pir','maxu'],name ='method')

    results = pd.DataFrame(index = idx,columns = functions )

    for bi,ai in idx:
    n = ai
    ws = np.random.rand(n,2)* 10 + np.array([[40,30 ]])
    A = pd.DataFrame(
    np.hstack([ - np.ones((n,1)),ws,ws + np.random.rand(n,2)* 2 ]),
    columns = ['type','latw','lngs','late','lngn']

    m = int(bi ** .5)
    prng = np.arange(m)/ m * 10 + .5
    B = pd.DataFrame(
    np.array(list(product(prng,prng)))+ np.array [[40,30]]),
    columns = ['Lat',Lng])
    B.insert(0,'type',np.random.randint(3,size = len (B)))
    在函数中的j:
    results.set_value(
    (bi,ai),j,
    timeit(
    ' {}(A,B)'。格式(j),
    'from __main__ import {},A,B'.format(j),
    number = 10



    results.plot(rot = 45)






    代码



      from collections import namedtuple 
    TwoSided = namedtuple('TwoSided',['match','idx','found'])

    def two_sided_search (左,右,a):
    #sort`a`并保存订单以获取
    argsort = a.argsort()
    asorted = a [argsort]

    #在`asorted`中`````````````````````````````````````````````````````` ,side ='right')

    rng = np.arange(left.size)

    #a match h当我们发现`left`的时候,会发现`````````````以后发现
    match_idx = rng [s_left< s_right]

    #忽略没有匹配的位置
    left_pos = s_left [match_idx]
    right_pos = s_right [match_idx]
    #从哪里找到的距离到哪里被发现
    #这表示一个被左右包裹的项目
    diff_pos = right_pos - left_pos
    #在`match_idx`上建立一个索引,与
    中的所有职位配对#`asorted`。
    d2 = left_pos - np.append(0,diff_pos [: - 1] .cumsum())
    p1 = np.arange(len(left_pos))。repeat(diff_pos)
    p2 = np.arange(diff_pos.sum())+ d2.repeat(diff_pos)

    #返回
    #`match_idx`,它是左或右的整数片
    #对于match_idx中的每个元素,
    #a中有一个或多个元素被left和right
    #p1包围与`argsort [p2]`
    #配对的`match_idx`例如考虑`p1 = [1,1,2,2]`和`argsort [p2] = [3,5,12,5 ]`
    #这意味着`left [match_idx [1]]< a [3] right [match_idx [1]]`
    #和`left [match_idx [1]]< a [5] right [match_idx [1]]`
    #和`left [match_idx [2]]< a [12] right [match_idx [2]]`
    #和`left [match_idx [2]]< a [5]

    返回TwoSided(match_idx,p1,argsort [p2])

    def intersectNd(a,b):
    #从http://stackoverflow.com/a/8317403/2336654
    #返回2个二维数组的整数
    #的策略是转换为结构化数组,然后
    #使用np .intersect1d然后转换回非结构化
    ncols = a.shape [1]
    dtype = dict(
    names = ['f {}'。format(i)for i in range(ncols )],
    formats = ncols * [a.dtype]


    返回np.intersect1d(
    a.view(dtype),b.view(dtype )
    ).view(a.dtype).reshape(-1,ncols)

    def where_in(b1,b2):
    #返回一个lenght len(b2 )其中True
    #当b2的元素在b1
    bsort = b1.argsort()
    bsrtd = b1 [bsort]
    bsrch = bsrtd.searchsorted(b2)%bsrtd .size
    return bsrtd [bsrch] == b2

    def process(A,B,reassign_type = False):
    lats = two_sided_se arch(
    A.latw.values,
    A.late.values,
    B.Lat.values,


    #子集A& B
    #无点寻找经度匹配
    #在没有纬度匹配的行
    lngs = A.lngs.values [lats.match]
    lngn = A.lngn.values [lats.match]
    u,i = np.unique(lats.found,return_index = 1)
    lng = B.Lng.values [u]

    lons = two_sided_search
    lngs,lngn,lng


    #`lats.match`是我们找到成功纬度的地方
    #`lons.match`是我们找到成功经度的地方
    #由于`lons.match`基于`lats.match`,我们可以切
    #来获得满足
    #的原始位置'$ b#the latw< blat迟到lngs< blng< lngn
    match = lats.match [lons.match]

    #然而,对于A中可能与B匹配的每一行,
    #可能有多个B的

    #我们通过查找
    #来开始解决这个问题,在这里我们已经匹配了
    #这里给出了一个布尔数列的lats.match
    #index对于满足条件的B
    #中每行重复的值
    wi = where_in(lons.match,lats.idx)

    #a(?x 2)与
    lon_pos = np.hstack([
    lats.match [lons.match [lons.idx],None])匹配的A中
    #行的组合,
    u [lons
    ])

    lat_pos = np.hstack([
    lats.match [lats.idx [wi],None],
    lats。发现[wi,无]
    ])

    x = intersectNd(lat_pos,lon_pos)
    p,pi = np.unique(x [:, 0],return_index = 1)

    如果reassign_type:
    A.iloc [x [p i,0],A.columns.get_loc('type')] = \
    B.iloc [x [pi,1],B.columns.get_loc('type')]值
    else:
    return x pi


    I have data frame "A" that looks like this:

    type    latw    lngs    late    lngn
    0   1000    45.457966   9.174864    45.458030   9.174907
    1   1000    45.457966   9.174864    45.458030   9.174907
    2   1000    45.458030   9.174864    45.458094   9.174907
    3   1000    45.458094   9.174864    45.458157   9.174907
    4   1000    45.458157   9.174864    45.458221   9.174907
    5   1000    45.458221   9.174864    45.458285   9.174907
    6   1000    45.458285   9.174864    45.458349   9.174907
    7   1000    45.458349   9.174864    45.458413   9.174907
    8   1000    45.458413   9.174864    45.458477   9.174907
    9   1000    45.458477   9.174864    45.458540   9.174907
    10  1000    45.458540   9.174864    45.458604   9.174907
    11  1000    45.458604   9.174864    45.458668   9.174907
    12  1000    45.458668   9.174864    45.458732   9.174907
    13  1000    45.458732   9.174864    45.458796   9.174907
    14  1000    45.458796   9.174864    45.458860   9.174907
    15  1000    45.458860   9.174864    45.458923   9.174907
    16  1000    45.458923   9.174864    45.458987   9.174907
    17  1000    45.458987   9.174864    45.459051   9.174907
    18  1000    45.459051   9.174864    45.459115   9.174907
    19  1000    45.459115   9.174864    45.459179   9.174907
    20  1000    45.459179   9.174864    45.459243   9.174907
    21  1000    45.459243   9.174864    45.459306   9.174907
    22  1000    45.459306   9.174864    45.459370   9.174907
    23  1000    45.459370   9.174864    45.459434   9.174907
    24  1000    45.459434   9.174864    45.459498   9.174907
    25  1000    45.459498   9.174864    45.459562   9.174907
    26  1000    45.459562   9.174864    45.459626   9.174907
    27  1000    45.459626   9.174864    45.459689   9.174907
    28  1000    45.459689   9.174864    45.459753   9.174907
    29  1000    45.459753   9.174864    45.459817   9.174907
    ... ... ... ... ... ...
    970 1000    45.460583   9.175545    45.460647   9.175587
    971 1000    45.460647   9.175545    45.460711   9.175587
    972 1000    45.460711   9.175545    45.460775   9.175587
    973 1000    45.460775   9.175545    45.460838   9.175587
    974 1000    45.460838   9.175545    45.460902   9.175587
    975 1000    45.460902   9.175545    45.460966   9.175587
    976 1000    45.460966   9.175545    45.461030   9.175587
    977 1000    45.461030   9.175545    45.461094   9.175587
    978 1000    45.461094   9.175545    45.461157   9.175587
    979 1000    45.461157   9.175545    45.461221   9.175587
    980 1000    45.461221   9.175545    45.461285   9.175587
    981 1000    45.461285   9.175545    45.461349   9.175587
    982 1000    45.461349   9.175545    45.461413   9.175587
    983 1000    45.461413   9.175545    45.461477   9.175587
    984 1000    45.461477   9.175545    45.461540   9.175587
    985 1000    45.461540   9.175545    45.461604   9.175587
    986 1000    45.461604   9.175545    45.461668   9.175587
    987 1000    45.457966   9.175587    45.458030   9.175630
    988 1000    45.458030   9.175587    45.458094   9.175630
    989 1000    45.458094   9.175587    45.458157   9.175630
    990 1000    45.458157   9.175587    45.458221   9.175630
    991 1000    45.458221   9.175587    45.458285   9.175630
    992 1000    45.458285   9.175587    45.458349   9.175630
    993 1000    45.458349   9.175587    45.458413   9.175630
    994 1000    45.458413   9.175587    45.458477   9.175630
    995 1000    45.458477   9.175587    45.458540   9.175630
    996 1000    45.458540   9.175587    45.458604   9.175630
    997 1000    45.458604   9.175587    45.458668   9.175630
    998 1000    45.458668   9.175587    45.458732   9.175630
    999 1000    45.458732   9.175587    45.458796   9.175630
    

    It has 22,000,000 rows × 5 columns and there is data frame "B" which looks like this:

        type        Lat       Lng
    0      0  45.465739  9.180830
    1      2  45.463950  9.187113
    2      1  45.468015  9.180648
    3      1  45.462209  9.187447
    4      0  45.459578  9.184007
    5      1  45.459822  9.187034
    6      2  45.454988  9.180310
    7      2  45.459818  9.189377
    8      0  45.462200  9.187440
    9      0  45.467160  9.180100
    10     2  45.459407  9.183300
    11     2  45.457699  9.187434
    12     1  45.455319  9.186697
    13     0  45.461138  9.191943
    14     2  45.456397  9.189028
    15     0  45.457062  9.185878
    16     1  45.461980  9.187024
    17     1  45.464319  9.183142
    18     2  45.464227  9.187065
    19     1  45.460886  9.185216
    

    It has 2,000,000 rows × 3 columns. I want to replace type's value of data frame "A" with "B" Where:

    A[latw]<B[lat]<A[late] and A[lngs]<B[lng]<B[lngn]
    

    I want to check a location from B belongs to which one of the rectangles in A.

    PS I'm looking for the fastest way in python such as using parallel processing.

    解决方案

    High Level Explanation

    This answer is quite involved. The following are high-level points

    • Use np.searchsorted with the west latitudes in A relative to the sorted latitudes in B. Repeat this for east latitudes but with a parameter of side='right'. This tells me where each west/east latitudes fall in the spectrum defined in B. np.searchsorted are index values. So if the east based results are greater than the west based results, that implies that there was a latitude from B between west and east.
      • Track where east searches are greater than west searches.
      • Find the difference. Not only do I care if the east search is greater than the west search, the difference tells me how many B.Lat's are in between. I expand arrays to track all possible matches.
    • Repeat this process for south and north longitudes but only over the subset in which we found matching latitudes. No need to waste effort looking where we know we don't have a match.
      • Perform tricky slicing to unwind positions
    • Use numpy structured arrays to find intersections of 2-D arrays
    • This final intersection has ordinal positions from A and matching ordinal positions from B.
      • remember I said we could possible match more than one row from B, I take the first.

    Demonstration

    use A and B provided by @MaxU

    p = process(A, B)
    print(p)
    
    [[3 0]
     [5 1]
     [9 2]]
    

    To show the results and compare. I'll put them side by side

    pd.concat([
            A.iloc[p[:, 0]].reset_index(drop=True),
            B.iloc[p[:, 1]].reset_index(drop=True)
        ], axis=1, keys=['A', 'B'])
    

    However, to swap the type values, I put a flag in the process function to reassign. Just run process(A, B, reassign_type=True)

    timing
    functions
    I defined the functions to return the same thing and make it an apples to apples comparison

    def pir(A, B):
        return A.values[process(A, B)[:, 0], 0]
    
    def maxu(A, B):
        return B.apply(
            lambda x: A.loc[
                (A.latw < x.Lat) &
                (x.Lat < A.late) &
                (A.lngs < x.Lng) &
                (x.Lng < A.lngn), 'type'
            ].head(1),
            axis=1
        ).values.diagonal()
    

    more thorough testing

    from timeit import timeit
    from itertools import product
    
    rng = np.arange(20, 51, 5)
    idx = pd.MultiIndex.from_arrays([rng, rng], names=['B', 'A'])
    functions = pd.Index(['pir', 'maxu'], name='method')
    
    results = pd.DataFrame(index=idx, columns=functions)
    
    for bi, ai in idx:
        n = ai
        ws = np.random.rand(n, 2) * 10 + np.array([[40, 30]])
        A = pd.DataFrame(
            np.hstack([-np.ones((n, 1)), ws, ws + np.random.rand(n, 2) * 2]),
            columns=['type', 'latw', 'lngs', 'late', 'lngn']
        )
        m = int(bi ** .5)
        prng = np.arange(m) / m * 10 + .5
        B = pd.DataFrame(
            np.array(list(product(prng, prng))) + np.array([[40, 30]]),
            columns=['Lat', "Lng"])
        B.insert(0, 'type', np.random.randint(3, size=len(B)))
        for j in functions:
            results.set_value(
                (bi, ai), j,
                timeit(
                    '{}(A, B)'.format(j),
                    'from __main__ import {}, A, B'.format(j),
                    number=10
                )
            )
    
    results.plot(rot=45)
    


    Code

    from collections import namedtuple
    TwoSided = namedtuple('TwoSided', ['match', 'idx', 'found'])
    
    def two_sided_search(left, right, a):
        # sort `a` and save order for later
        argsort = a.argsort()
        asorted = a[argsort]
    
        # where does `left` fall in `asorted`
        s_left = asorted.searchsorted(left)
        # where does `right` fall in `asorted`
        s_right = asorted.searchsorted(right, side='right')
    
        rng = np.arange(left.size)
    
        # a match happens when where we found `left`, `right` was found later
        match_idx = rng[s_left < s_right]
    
        # ignore positions with no match
        left_pos = s_left[match_idx]
        right_pos = s_right[match_idx]
        # distance from where left was found to where right was found
        # this represents how many items in a are wrapped by left and right
        diff_pos = right_pos - left_pos
        # build an index on `match_idx` paired with all positions in
        # `asorted`.
        d2 = left_pos - np.append(0, diff_pos[:-1].cumsum())
        p1 = np.arange(len(left_pos)).repeat(diff_pos)
        p2 = np.arange(diff_pos.sum()) + d2.repeat(diff_pos)
    
        # returns
        # `match_idx` which is an integer slice of either `left` or `right`
        # for each element in `match_idx` there are one or more elements in
        # `a` that were surrounded by `left` and `right`
        # `p1` are the positions in `match_idx` that are paired with `argsort[p2]`
        # for example, consider `p1 = [1, 1, 2, 2]` and `argsort[p2] = [3, 5, 12, 5]`
        # this means that `left[match_idx[1]] < a[3] < right[match_idx[1]]`
        # and `left[match_idx[1]] < a[5] < right[match_idx[1]]`
        # and `left[match_idx[2]] < a[12] < right[match_idx[2]]`
        # and `left[match_idx[2]] < a[5] < right[match_idx[2]]`
    
        return TwoSided(match_idx, p1, argsort[p2])
    
    def intersectNd(a, b):
        # taken from http://stackoverflow.com/a/8317403/2336654
        # returns the interesection of 2 2-D arrays
        # the strategy is to convert to a structured array then
        # use np.intersect1d then convert back to unstructured
        ncols = a.shape[1]
        dtype = dict(
            names=['f{}'.format(i) for i in range(ncols)],
            formats=ncols * [a.dtype]
        )
    
        return np.intersect1d(
            a.view(dtype), b.view(dtype)
        ).view(a.dtype).reshape(-1, ncols)
    
    def where_in(b1, b2):
        # return a mask of lenght len(b2) where True
        # when element of b2 is in b1
        bsort = b1.argsort()
        bsrtd = b1[bsort]
        bsrch = bsrtd.searchsorted(b2) % bsrtd.size
        return bsrtd[bsrch] == b2
    
    def process(A, B, reassign_type=False):
        lats = two_sided_search(
            A.latw.values,
            A.late.values,
            B.Lat.values,
        )
    
        # subset A & B
        # no point looking for longitude matches
        # in rows without a latitude match
        lngs = A.lngs.values[lats.match]
        lngn = A.lngn.values[lats.match]
        u, i = np.unique(lats.found, return_index=1)
        lng = B.Lng.values[u]
    
        lons = two_sided_search(
            lngs, lngn, lng
        )
    
        # `lats.match` is where we found successful latitudes
        # `lons.match` is where we found successful longitudes
        # since `lons.match` was based on `lats.match` we can slice
        # to get the original positions of `A` that satisfy both
        # the latw < blat < late & lngs < blng < lngn
        match = lats.match[lons.match]
    
        # however, for every row in A that might be a match for B,
        # there may be multilple B's
        #
        # We start addressing this by finding
        # where in lats.idx do we have matches
        # this gives me a boolean array of lats.match
        # index values repeated for each row in B
        # that satisfies our conditions
        wi = where_in(lons.match, lats.idx)
    
        # a (? x 2) array representing all combinations of
        # rows in A that matched the 
        lon_pos = np.hstack([
                lats.match[lons.match[lons.idx], None],
                u[lons.found, None]
            ])
    
        lat_pos = np.hstack([
                lats.match[lats.idx[wi], None],
                lats.found[wi, None]
            ])
    
        x = intersectNd(lat_pos, lon_pos)
        p, pi = np.unique(x[:, 0], return_index=1)
    
        if reassign_type:
            A.iloc[x[pi, 0], A.columns.get_loc('type')] = \
                B.iloc[x[pi, 1], B.columns.get_loc('type')].values
        else:
            return x[pi]
    

    这篇关于快速(矢量化)的方式来找到属于同样大小的矩形(由两点给出)的一个DF中的点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆