快速(矢量化)的方式来找到属于同样大小的矩形(由两点给出)的一个DF中的点 [英] Fast (vectorized) way to find points in one DF belonging to equally sized rectangles (given by two points) from the second DF
问题描述
我有数据框A,如下所示:
type latw lngs late lngn
0 1000 45.457966 9.174864 45.458030 9.174907
1 1000 45.457966 9.174864 45.458030 9.174907
2 1000 45.458030 9.174864 45.458094 9.174907
3 1000 45.458094 9.174864 45.458157 9.174907
4 1000 45.458157 9.174864 45.458221 9.174907
5 1000 45.458221 9.174864 45.458285 9.174907
6 1000 45.458285 9.174864 45.458349 9.174907
7 1000 45.458349 9.174864 45.458413 9.174907
8 1000 45.458413 9.174864 45.458477 9.174907
9 1000 45.458477 9.174864 45.458540 9.174907
10 1000 45.458540 9.174864 45.458604 9.174907
11 1000 45.458604 9.174864 45.458668 9.174907
12 1000 45.458668 9.174864 45.458732 9.174907
13 1000 45.458732 9.174864 45.458796 9.174907
14 1000 45.4 58796 9.174864 45.458860 9.174907
15 1000 45.458860 9.174864 45.458923 9.174907
16 1000 45.458923 9.174864 45.458987 9.174907
17 1000 45.458987 9.174864 45.459051 9.174907
18 1000 45.459051 9.174864 45.459115 9.174907
19 1000 45.459115 9.174864 45.459179 9.174907
20 1000 45.459179 9.174864 45.459243 9.174907
21 1000 45.459243 9.174864 45.459306 9.174907
22 1000 45.459306 9.174864 45.459370 9.174907
23 1000 45.459370 9.174864 45.459434 9.174907
24 1000 45.459434 9.174864 45.459498 9.174907
25 1000 45.459498 9.174864 45.459562 9.174907
26 1000 45.459562 9.174864 45.459626 9.174907
27 1000 45.459626 9.174864 45.459689 9.174907
28 1000 45.459689 9.174864 45.459753 9.174907
29 1000 45.459753 9.174864 45.459817 9.174907
... ... ... ... ... ...
970 1000 45.460583 9.175545 45.460647 9.175587
971 1000 45.460647 9.175545 45.460711 9.175587
972 1000 45.460711 9.175545 45.460775 9.175587
973 1000 45.460775 9.175545 45.460838 9.175587
974 1000 45.460838 9.175545 45.460902 9.175587
975 1000 45.460902 9.175545 45.460966 9.175587
976 1000 45.460966 9.175545 45.461030 9.175587
977 1000 45.461030 9.175545 45.461094 9.175587
978 1000 45.461094 9.175545 45.461157 9.175587
979 1000 45.461157 9.175545 45.461221 9.175587
980 1000 45.461221 9.175545 45.461285 9.175587
981 1000 45.461285 9.175545 45.461349 9.175587
982 1000 45.461349 9.175545 45.461413 9.175587
983 1000 45.461413 9.175545 45.461477 9.175587
984 1000 45.461477 9.175545 45.461540 9.175587
985 1000 45.461540 9.175545 45.461604 9.175587
986 1000 45.461604 9.175545 45.461668 9.175587
987 1000 45.457966 9.175587 45.458030 9.175630
988 1000 45.458030 9.175587 45.458094 9.175630
989 1000 45.458094 9.175587 45.458157 9.175630
990 1000 45.458157 9.175587 45.458221 9.175630
991 1000 45.458221 9.175587 45.458285 9.175630
992 1000 45.458285 9.175587 45.458349 9.175630
993 1000 45.458349 9.175587 45.458413 9.175630
994 1000 45.458413 9.175587 45.458477 9.175630
995 1000 45.458477 9.175587 45.458540 9.175630
996 1000 45.458540 9.175587 45.458604 9.175630
997 1000 45.458604 9.175587 45.458668 9.175630
998 1000 45.458668 9.175587 45.458732 9.175630
999 1000 45.458732 9.175587 45.458796 9.175630
它有22,000,000行×5列,数据框B如下所示:
类型Lat Lng
0 0 45.465739 9.180830
1 2 45.463950 9.187113
2 1 45.468015 9.180648
3 1 45.462209 9.187447
4 0 45.459578 9.184007
5 1 45.459822 9.187034
6 2 45.454988 9.180310
7 2 45.459818 9.189377
8 0 45.462200 9.187440
9 0 45.467160 9.180100
10 2 45.459407 9.183300
11 2 45.457699 9.187434
12 1 45.455319 9.186697
13 0 45.461138 9.191943
14 2 45.456397 9.189028
15 0 45.457062 9.185878
16 1 45.461980 9.187024
17 1 45.464319 9.183142
18 2 45.464227 9.187065
19 1 45.460886 9.185216
它有2,000,000行×3列。
我想用B替换数据框A的类型的值,其中:
A [latw] < B [lat]< A [late]和A [lngs]< B [lng]< B [lngn]
我想检查一个B中的位置属于A中的哪个矩形。
PS我正在寻找最快的方式在python例如使用并行处理。
高级说明
这个答案是很相关的。以下是高级别点
- 使用
但是,为了交换
类型
值,我在进程中添加了一个标志
函数重新分配。只需运行进程(A,B,reassign_type = True)
>
功能
我定义了返回相同内容的功能,并使其成为苹果比较def pir(A,B):
pre>
return A.values [process(A,B) ,0],0]
def maxu(A,B):
return B.apply(
lambda x:A.loc [
(A.latw < x.Lat)&
(x.Lat< A.late)&
(A.lngs< x.Lng)&
(x.Lng& A.lngn),'type'
] .head(1),
axis = 1
).values.diagonal()
更多彻底测试
from timeit import timeit
from itertools import product
rng = np.arange(20,51,5)
idx = pd.MultiIndex.from_arrays([rng,rng],names = ['B','A '])
functions = pd.Index(['pir','maxu'],name ='method')
results = pd.DataFrame(index = idx,columns = functions )
for bi,ai in idx:
n = ai
ws = np.random.rand(n,2)* 10 + np.array([[40,30 ]])
A = pd.DataFrame(
np.hstack([ - np.ones((n,1)),ws,ws + np.random.rand(n,2)* 2 ]),
columns = ['type','latw','lngs','late','lngn']
)
m = int(bi ** .5)
prng = np.arange(m)/ m * 10 + .5
B = pd.DataFrame(
np.array(list(product(prng,prng)))+ np.array [[40,30]]),
columns = ['Lat',Lng])
B.insert(0,'type',np.random.randint(3,size = len (B)))
在函数中的j:
results.set_value(
(bi,ai),j,
timeit(
' {}(A,B)'。格式(j),
'from __main__ import {},A,B'.format(j),
number = 10
)
)
results.plot(rot = 45)
代码
from collections import namedtuple
TwoSided = namedtuple('TwoSided',['match','idx','found'])
def two_sided_search (左,右,a):
#sort`a`并保存订单以获取
argsort = a.argsort()
asorted = a [argsort]
#在`asorted`中`````````````````````````````````````````````````````` ,side ='right')
rng = np.arange(left.size)
#a match h当我们发现`left`的时候,会发现`````````````以后发现
match_idx = rng [s_left< s_right]
#忽略没有匹配的位置
left_pos = s_left [match_idx]
right_pos = s_right [match_idx]
#从哪里找到的距离到哪里被发现
#这表示一个被左右包裹的项目
diff_pos = right_pos - left_pos
#在`match_idx`上建立一个索引,与
中的所有职位配对#`asorted`。
d2 = left_pos - np.append(0,diff_pos [: - 1] .cumsum())
p1 = np.arange(len(left_pos))。repeat(diff_pos)
p2 = np.arange(diff_pos.sum())+ d2.repeat(diff_pos)
#返回
#`match_idx`,它是左或右的整数片
#对于match_idx中的每个元素,
#a中有一个或多个元素被left和right
#p1包围与`argsort [p2]`
#配对的`match_idx`例如考虑`p1 = [1,1,2,2]`和`argsort [p2] = [3,5,12,5 ]`
#这意味着`left [match_idx [1]]< a [3] right [match_idx [1]]`
#和`left [match_idx [1]]< a [5] right [match_idx [1]]`
#和`left [match_idx [2]]< a [12] right [match_idx [2]]`
#和`left [match_idx [2]]< a [5]
返回TwoSided(match_idx,p1,argsort [p2])
def intersectNd(a,b):
#从http://stackoverflow.com/a/8317403/2336654
#返回2个二维数组的整数
#的策略是转换为结构化数组,然后
#使用np .intersect1d然后转换回非结构化
ncols = a.shape [1]
dtype = dict(
names = ['f {}'。format(i)for i in range(ncols )],
formats = ncols * [a.dtype]
)
返回np.intersect1d(
a.view(dtype),b.view(dtype )
).view(a.dtype).reshape(-1,ncols)
def where_in(b1,b2):
#返回一个lenght len(b2 )其中True
#当b2的元素在b1
bsort = b1.argsort()
bsrtd = b1 [bsort]
bsrch = bsrtd.searchsorted(b2)%bsrtd .size
return bsrtd [bsrch] == b2
def process(A,B,reassign_type = False):
lats = two_sided_se arch(
A.latw.values,
A.late.values,
B.Lat.values,
)
#子集A& B
#无点寻找经度匹配
#在没有纬度匹配的行
lngs = A.lngs.values [lats.match]
lngn = A.lngn.values [lats.match]
u,i = np.unique(lats.found,return_index = 1)
lng = B.Lng.values [u]
lons = two_sided_search
lngs,lngn,lng
)
#`lats.match`是我们找到成功纬度的地方
#`lons.match`是我们找到成功经度的地方
#由于`lons.match`基于`lats.match`,我们可以切
#来获得满足
#的原始位置'$ b#the latw< blat迟到lngs< blng< lngn
match = lats.match [lons.match]
#然而,对于A中可能与B匹配的每一行,
#可能有多个B的
#
#我们通过查找
#来开始解决这个问题,在这里我们已经匹配了
#这里给出了一个布尔数列的lats.match
#index对于满足条件的B
#中每行重复的值
wi = where_in(lons.match,lats.idx)
#a(?x 2)与
lon_pos = np.hstack([
lats.match [lons.match [lons.idx],None])匹配的A中
#行的组合,
u [lons
])
lat_pos = np.hstack([
lats.match [lats.idx [wi],None],
lats。发现[wi,无]
])
x = intersectNd(lat_pos,lon_pos)
p,pi = np.unique(x [:, 0],return_index = 1)
如果reassign_type:
A.iloc [x [p i,0],A.columns.get_loc('type')] = \
B.iloc [x [pi,1],B.columns.get_loc('type')]值
else:
return x pi
I have data frame "A" that looks like this:
type latw lngs late lngn 0 1000 45.457966 9.174864 45.458030 9.174907 1 1000 45.457966 9.174864 45.458030 9.174907 2 1000 45.458030 9.174864 45.458094 9.174907 3 1000 45.458094 9.174864 45.458157 9.174907 4 1000 45.458157 9.174864 45.458221 9.174907 5 1000 45.458221 9.174864 45.458285 9.174907 6 1000 45.458285 9.174864 45.458349 9.174907 7 1000 45.458349 9.174864 45.458413 9.174907 8 1000 45.458413 9.174864 45.458477 9.174907 9 1000 45.458477 9.174864 45.458540 9.174907 10 1000 45.458540 9.174864 45.458604 9.174907 11 1000 45.458604 9.174864 45.458668 9.174907 12 1000 45.458668 9.174864 45.458732 9.174907 13 1000 45.458732 9.174864 45.458796 9.174907 14 1000 45.458796 9.174864 45.458860 9.174907 15 1000 45.458860 9.174864 45.458923 9.174907 16 1000 45.458923 9.174864 45.458987 9.174907 17 1000 45.458987 9.174864 45.459051 9.174907 18 1000 45.459051 9.174864 45.459115 9.174907 19 1000 45.459115 9.174864 45.459179 9.174907 20 1000 45.459179 9.174864 45.459243 9.174907 21 1000 45.459243 9.174864 45.459306 9.174907 22 1000 45.459306 9.174864 45.459370 9.174907 23 1000 45.459370 9.174864 45.459434 9.174907 24 1000 45.459434 9.174864 45.459498 9.174907 25 1000 45.459498 9.174864 45.459562 9.174907 26 1000 45.459562 9.174864 45.459626 9.174907 27 1000 45.459626 9.174864 45.459689 9.174907 28 1000 45.459689 9.174864 45.459753 9.174907 29 1000 45.459753 9.174864 45.459817 9.174907 ... ... ... ... ... ... 970 1000 45.460583 9.175545 45.460647 9.175587 971 1000 45.460647 9.175545 45.460711 9.175587 972 1000 45.460711 9.175545 45.460775 9.175587 973 1000 45.460775 9.175545 45.460838 9.175587 974 1000 45.460838 9.175545 45.460902 9.175587 975 1000 45.460902 9.175545 45.460966 9.175587 976 1000 45.460966 9.175545 45.461030 9.175587 977 1000 45.461030 9.175545 45.461094 9.175587 978 1000 45.461094 9.175545 45.461157 9.175587 979 1000 45.461157 9.175545 45.461221 9.175587 980 1000 45.461221 9.175545 45.461285 9.175587 981 1000 45.461285 9.175545 45.461349 9.175587 982 1000 45.461349 9.175545 45.461413 9.175587 983 1000 45.461413 9.175545 45.461477 9.175587 984 1000 45.461477 9.175545 45.461540 9.175587 985 1000 45.461540 9.175545 45.461604 9.175587 986 1000 45.461604 9.175545 45.461668 9.175587 987 1000 45.457966 9.175587 45.458030 9.175630 988 1000 45.458030 9.175587 45.458094 9.175630 989 1000 45.458094 9.175587 45.458157 9.175630 990 1000 45.458157 9.175587 45.458221 9.175630 991 1000 45.458221 9.175587 45.458285 9.175630 992 1000 45.458285 9.175587 45.458349 9.175630 993 1000 45.458349 9.175587 45.458413 9.175630 994 1000 45.458413 9.175587 45.458477 9.175630 995 1000 45.458477 9.175587 45.458540 9.175630 996 1000 45.458540 9.175587 45.458604 9.175630 997 1000 45.458604 9.175587 45.458668 9.175630 998 1000 45.458668 9.175587 45.458732 9.175630 999 1000 45.458732 9.175587 45.458796 9.175630
It has 22,000,000 rows × 5 columns and there is data frame "B" which looks like this:
type Lat Lng 0 0 45.465739 9.180830 1 2 45.463950 9.187113 2 1 45.468015 9.180648 3 1 45.462209 9.187447 4 0 45.459578 9.184007 5 1 45.459822 9.187034 6 2 45.454988 9.180310 7 2 45.459818 9.189377 8 0 45.462200 9.187440 9 0 45.467160 9.180100 10 2 45.459407 9.183300 11 2 45.457699 9.187434 12 1 45.455319 9.186697 13 0 45.461138 9.191943 14 2 45.456397 9.189028 15 0 45.457062 9.185878 16 1 45.461980 9.187024 17 1 45.464319 9.183142 18 2 45.464227 9.187065 19 1 45.460886 9.185216
It has 2,000,000 rows × 3 columns. I want to replace type's value of data frame "A" with "B" Where:
A[latw]<B[lat]<A[late] and A[lngs]<B[lng]<B[lngn]
I want to check a location from B belongs to which one of the rectangles in A.
PS I'm looking for the fastest way in python such as using parallel processing.
解决方案High Level Explanation
This answer is quite involved. The following are high-level points
- Use
np.searchsorted
with the west latitudes inA
relative to the sorted latitudes inB
. Repeat this for east latitudes but with a parameter ofside='right'
. This tells me where each west/east latitudes fall in the spectrum defined inB
.np.searchsorted
are index values. So if the east based results are greater than the west based results, that implies that there was a latitude fromB
between west and east.- Track where east searches are greater than west searches.
- Find the difference. Not only do I care if the east search is greater than the west search, the difference tells me how many
B.Lat
's are in between. I expand arrays to track all possible matches.
- Repeat this process for south and north longitudes but only over the subset in which we found matching latitudes. No need to waste effort looking where we know we don't have a match.
- Perform tricky slicing to unwind positions
- Use
numpy
structured arrays to find intersections of 2-D arrays- I used this answer from @JoeKington please up vote his answer because this is awesome!
- This final intersection has ordinal positions from
A
and matching ordinal positions fromB
.- remember I said we could possible match more than one row from
B
, I take the first.
- remember I said we could possible match more than one row from
Demonstration
use
A
andB
provided by @MaxUp = process(A, B) print(p) [[3 0] [5 1] [9 2]]
To show the results and compare. I'll put them side by side
pd.concat([ A.iloc[p[:, 0]].reset_index(drop=True), B.iloc[p[:, 1]].reset_index(drop=True) ], axis=1, keys=['A', 'B'])
However, to swap the
type
values, I put a flag in theprocess
function to reassign. Just runprocess(A, B, reassign_type=True)
timing
functions
I defined the functions to return the same thing and make it an apples to apples comparisondef pir(A, B): return A.values[process(A, B)[:, 0], 0] def maxu(A, B): return B.apply( lambda x: A.loc[ (A.latw < x.Lat) & (x.Lat < A.late) & (A.lngs < x.Lng) & (x.Lng < A.lngn), 'type' ].head(1), axis=1 ).values.diagonal()
more thorough testing
from timeit import timeit from itertools import product rng = np.arange(20, 51, 5) idx = pd.MultiIndex.from_arrays([rng, rng], names=['B', 'A']) functions = pd.Index(['pir', 'maxu'], name='method') results = pd.DataFrame(index=idx, columns=functions) for bi, ai in idx: n = ai ws = np.random.rand(n, 2) * 10 + np.array([[40, 30]]) A = pd.DataFrame( np.hstack([-np.ones((n, 1)), ws, ws + np.random.rand(n, 2) * 2]), columns=['type', 'latw', 'lngs', 'late', 'lngn'] ) m = int(bi ** .5) prng = np.arange(m) / m * 10 + .5 B = pd.DataFrame( np.array(list(product(prng, prng))) + np.array([[40, 30]]), columns=['Lat', "Lng"]) B.insert(0, 'type', np.random.randint(3, size=len(B))) for j in functions: results.set_value( (bi, ai), j, timeit( '{}(A, B)'.format(j), 'from __main__ import {}, A, B'.format(j), number=10 ) ) results.plot(rot=45)
Code
from collections import namedtuple TwoSided = namedtuple('TwoSided', ['match', 'idx', 'found']) def two_sided_search(left, right, a): # sort `a` and save order for later argsort = a.argsort() asorted = a[argsort] # where does `left` fall in `asorted` s_left = asorted.searchsorted(left) # where does `right` fall in `asorted` s_right = asorted.searchsorted(right, side='right') rng = np.arange(left.size) # a match happens when where we found `left`, `right` was found later match_idx = rng[s_left < s_right] # ignore positions with no match left_pos = s_left[match_idx] right_pos = s_right[match_idx] # distance from where left was found to where right was found # this represents how many items in a are wrapped by left and right diff_pos = right_pos - left_pos # build an index on `match_idx` paired with all positions in # `asorted`. d2 = left_pos - np.append(0, diff_pos[:-1].cumsum()) p1 = np.arange(len(left_pos)).repeat(diff_pos) p2 = np.arange(diff_pos.sum()) + d2.repeat(diff_pos) # returns # `match_idx` which is an integer slice of either `left` or `right` # for each element in `match_idx` there are one or more elements in # `a` that were surrounded by `left` and `right` # `p1` are the positions in `match_idx` that are paired with `argsort[p2]` # for example, consider `p1 = [1, 1, 2, 2]` and `argsort[p2] = [3, 5, 12, 5]` # this means that `left[match_idx[1]] < a[3] < right[match_idx[1]]` # and `left[match_idx[1]] < a[5] < right[match_idx[1]]` # and `left[match_idx[2]] < a[12] < right[match_idx[2]]` # and `left[match_idx[2]] < a[5] < right[match_idx[2]]` return TwoSided(match_idx, p1, argsort[p2]) def intersectNd(a, b): # taken from http://stackoverflow.com/a/8317403/2336654 # returns the interesection of 2 2-D arrays # the strategy is to convert to a structured array then # use np.intersect1d then convert back to unstructured ncols = a.shape[1] dtype = dict( names=['f{}'.format(i) for i in range(ncols)], formats=ncols * [a.dtype] ) return np.intersect1d( a.view(dtype), b.view(dtype) ).view(a.dtype).reshape(-1, ncols) def where_in(b1, b2): # return a mask of lenght len(b2) where True # when element of b2 is in b1 bsort = b1.argsort() bsrtd = b1[bsort] bsrch = bsrtd.searchsorted(b2) % bsrtd.size return bsrtd[bsrch] == b2 def process(A, B, reassign_type=False): lats = two_sided_search( A.latw.values, A.late.values, B.Lat.values, ) # subset A & B # no point looking for longitude matches # in rows without a latitude match lngs = A.lngs.values[lats.match] lngn = A.lngn.values[lats.match] u, i = np.unique(lats.found, return_index=1) lng = B.Lng.values[u] lons = two_sided_search( lngs, lngn, lng ) # `lats.match` is where we found successful latitudes # `lons.match` is where we found successful longitudes # since `lons.match` was based on `lats.match` we can slice # to get the original positions of `A` that satisfy both # the latw < blat < late & lngs < blng < lngn match = lats.match[lons.match] # however, for every row in A that might be a match for B, # there may be multilple B's # # We start addressing this by finding # where in lats.idx do we have matches # this gives me a boolean array of lats.match # index values repeated for each row in B # that satisfies our conditions wi = where_in(lons.match, lats.idx) # a (? x 2) array representing all combinations of # rows in A that matched the lon_pos = np.hstack([ lats.match[lons.match[lons.idx], None], u[lons.found, None] ]) lat_pos = np.hstack([ lats.match[lats.idx[wi], None], lats.found[wi, None] ]) x = intersectNd(lat_pos, lon_pos) p, pi = np.unique(x[:, 0], return_index=1) if reassign_type: A.iloc[x[pi, 0], A.columns.get_loc('type')] = \ B.iloc[x[pi, 1], B.columns.get_loc('type')].values else: return x[pi]
这篇关于快速(矢量化)的方式来找到属于同样大小的矩形(由两点给出)的一个DF中的点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
- Use