数百万个点可使用的GEO实施 [英] Which GEO implementation to use for millions of points

查看:77
本文介绍了数百万个点可使用的GEO实施的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试找出要使用哪个GEO实现基于长/纬度找到特定点的最近点.我将有数百万个甚至数十亿个不同的纬度/经度点需要进行比较.我一直在寻找许多不同的实现来完成我需要完成的工作.我研究了Postgis(看起来非常受欢迎,并且表现良好),Neo4J(图数据库对我来说是一个新概念,我不确定它们的性能如何),AWS dynamodb geohash(可很好地缩放,但只编写了库) Java,我希望在node.js中编写一个库,等等,但无法弄清楚哪种方法效果最好.我纯粹是在研究性能而不是功能.我需要做的就是将一个点与所有点进行比较并找到最接近的点(读操作),并且能够快速更改数据库中的一个点(写操作).任何人都可以根据这些要求提出一个好的实施方案

I am trying to figure out which GEO implementation to use to find the nearest points based on long/lat to a certain point. I will have millions if not billions of different latitude/longitude points that will need to be compared. I have been looking at many different implementations to do the job I need to be done. I have looked into Postgis (looks like it is very popular and performs well), Neo4J (Graph databases are a new concept to me and I am unsure how they perfrom), AWS dynamodb geohash (Scales very well, but only library is written in Java, I am hoping to write a library in node.js), etc but can't figure out which would perform best. I am purely looking into performance opposed to number of features. All I need to be able is to compare one point to all points and find the closest (read operation), and as well, be able to change a point in the database quickly (write operation). Could anyone suggest based on these requirements a good implementation

推荐答案

PostGIS具有一些用于地理哈希的功能.如果您使字符串足够长,则搜索会更快(每个框的碰撞次数更少,并且与它的8个邻居碰撞),但是插入新点时,geohash生成会变慢.

PostGIS has several function for geohashing. If you make your strings long enough the search becomes quicker (fewer collisions per box + its 8 neighbours) but the geohash generation slower on inserting new points.

问题还在于您想要达到的精度.随着纬度的增加,纬度/经度的距离"变差,因为经度从赤道的约110 km减小到两极的0,而纬度始终为约110 km.在45度的中纬度处,经度接近79公里,距离误差为2(sqr(110/79)).球形距离为您提供经纬度对之间的真实距离,计算起来非常昂贵(正在进行很多三角学运算),然后您无法进行地理哈希处理(除非您将所有点都转换为平面坐标).

The question is also how accurate you want to be. At increasing latitude, lat/long "distance" deteriorates because a degree of longitude shrinks from about 110km at the Equator to 0 at the poles, while a degree of latitude is always about 110km. At the mid-latitude of 45 degrees a degree of longitude is nearly 79km, giving an error in distance of a factor of 2 (sqr(110/79)). Spherical distance to give you true distance between lat/long pairs is very expensive to calculate (lots of trigonometry going on) and then you geohashing won't work (unless you convert all points to planar coordinates).

以下一种可行的解决方案:

A solution that might work is the following:

  • CREATE INDEX hash8 ON tablename(substring(hash_column FROM 1 FOR 8)).这样一来,您在框上的索引就是分辨率的两倍,这有助于查找点并减少搜索相邻哈希框的需要.
  • 在点的INSERT上,使用PostGIS将其长度为9(约10m分辨率)的geohash计算为hash_column.您可以在此处使用BEFORE INSERT TRIGGER.
  • CREATE INDEX hash8 ON tablename(substring(hash_column FROM 1 FOR 8)). This gives you an index on a box twice as large as your resolution, which helps finding points and reducing the need to search neighbouring hash boxes.
  • On INSERT of a point, compute its geohash of length 9 (10m resolution approx.) into hash_column, using PostGIS. You could use a BEFORE INSERT TRIGGER here.

在函数中:

  • 给出一个点,通过查找geohash值缩短为8个字符(等于给定点8个字符的geohash)的所有点来找到最接近的点(因此上面的索引).
  • 使用球形坐标计算到每个遇到的点的距离,并保持最接近的点.但是,由于您只是在寻找最接近的点(至少在最初是这样),因此不要使用球坐标搜索距离,而应使用以下优化方法,这将使搜索速度大大提高.
  • 计算给定点是否比最接近的计算点更接近8字符geohash确定的框的边缘.如果是这样,请在其8个邻居的所有点上使用7字符的geohash重复该过程.可以通过计算到各个框的边和角的距离并仅评估相关的相邻哈希框来高度优化此方法.我把这个留给你去修补.

在任何情况下,这都不会特别快.如果确实要达到数十亿点,则可能需要考虑集群化,它具有相当自然的地理哈希解决方案(例如,在substring(hash_column FROM 1 FOR 2)上分解表,给您四个象限).只需确保您负责跨边界搜索即可.

In any case, this will not be particularly speedy. If you are indeed going towards billions of points you might want to think about clustering which has a rather "natural" solution for geohashing (break up your table on substring(hash_column FROM 1 FOR 2) for instance, giving you four quadrants). Just make sure that you account for cross-boundary searches.

两个优化可以很快完成:

首先,归一化"您的球面坐标(即:随着纬度的增加补偿经度减小的长度),以便您可以使用伪笛卡尔"搜索最近的点方法.这仅在点彼此靠近的情况下才有效,但是由于您要处理许多点,因此这不是问题.更具体地说,这应适用于长度大于等于6的geohash框中的所有点.

First, "normalize" your spherical coordinates (meaning: compensate for the reduced length of a degree of longitude with increasing latitude) so that you can search for nearest points using a "pseudo-cartesian" approach. This only works if points are close together, but since you are working with lots of points this should not be a problem. More specifically, this should work for all points in geohash boxes of length 6 or more.

假设WGS84椭球体(在所有GPS设备中都使用),地球的长轴(a)为6,378,137米,椭圆率(e2)为0.00669438.经度的一秒长度为

Assuming the WGS84 ellipsoid (which is used in all GPS devices), the Earth's major axis (a) is 6,378,137 meter, with an ellipticity (e2) of 0.00669438. A second of longitude has a length of

longSec := Pi * a * cos(lat) / sqrt(1 - e2 * sqr(sin(lat))) / 180 / 3600

longSec := 30.92208078 * cos(lat) / sqrt(1 - 0.00669438 * sqr(sin(lat)))

一秒钟的纬度:

latSec := 30.870265 - 155.506 * cos(2 * lat) + 0.0003264 + cos(4 * lat)

使局部坐标系为正方形"的校正因子是将经度值乘以longSec/latSec.

The correction factor to make your local coordinate system "square" is by multiplying your longitude values by longSec/latSec.

其次,由于您正在寻找最近的点,因此,由于计算量大的平方根,请勿搜索距离.而是搜索平方根内的项(如果需要的话),因为平方距离具有相同的选择最近点的属性.

Secondly, since you are looking for the nearest point, do not search on distance because of the computationally expensive square root. Instead, search on the term within the square root, the squared distance if you will, because this has the same property of selecting for the nearest point.

使用伪代码:

CREATE FUNCTION nearest_point(pt geometry, ptHash8 char(8)) RETURNS integer AS $$
DECLARE
  corrFactor double precision;
  ptLat    double precision;
  ptLong     double precision;
  currPt     record;
  minDist    double precision;
  diffLat    double precision;
  diffLong   double precision;
  minId      integer;
BEGIN
  minDist := 100000000.; -- a large value, 10km (squared)
  ptLat := ST_Y(pt);
  ptLong := ST_X(pt);
  corrFactor := 30.92208078 * cos(radians(ptLat)) / (sqrt(1 - 0.00669438 * power(sin(radians(ptLat)), 2)) *
                (30.870265 - 155.506 * cos(2 * radians(ptLat)) + 0.0003264 + cos(4 * radians(ptLat))));
  FOR currPt IN SELECT * FROM all_points WHERE hash8 = ptHash8
  LOOP
    diffLat := ST_Y(currPt.pt) - ptLat;
    diffLong := (ST_X(currPt.pt) - ptLong) * corrFactor; -- "square" things out
    IF (diffLat * diffLat) < (minDist * diffLong * diffLong) THEN -- no divisions here to speed thing up a little further
      minDist := (diffLat * diffLat) / (diffLong * diffLong); -- this does not happen so often
      minId := currPt.id;
    END IF;
  END LOOP;
  IF minDist < 100000000. THEN
    RETURN minId;
  ELSE
    RETURN NULL;
  END IF;
END; $$ LANGUAGE PLPGSQL STRICT;

不用说,这在C语言功能中会更快.另外,不要忘记进行边界检查以查看是否需要搜索相邻的geohash框.

Needless to say, this would be a lot faster in a C language function. Also, do not forget to do boundary checking to see if neighbouring geohash boxes need to be searched.

偶然地,空间纯粹主义者"不会索引8字符的geohash并从那里搜索;相反,它们将从9个字符的哈希开始,然后从那里向外进行工作.但是,初始哈希框中的未命中"(因为没有其他点,或者您靠近哈希框一侧)非常昂贵,因为您必须开始计算与相邻哈希框的距离并提取更多数据.在实践中,您应该从一个哈希盒开始工作,该哈希盒的大小约为典型最近点的两倍;该距离是多少取决于您的点集.

Incidentally, "spatial purists" would not index on the 8-char geohash and search from there; instead they would start from the 9-char hash and work outwards from there. However, a "miss" in your initial hash box (because there are no other points or you are close to a hash box side) is expensive because you have to start computing distances to neighbouring hash boxes and pull in more data. In practice you should work from a hash box which is about twice the size of the typical nearest point; what that distance is depends on your point set.

这篇关于数百万个点可使用的GEO实施的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆