用于数百万点的 GEO 实现 [英] Which GEO implementation to use for millions of points

查看:32
本文介绍了用于数百万点的 GEO 实现的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试找出基于 long/lat 到某个点的最近点使用哪个 GEO 实现.我将有数百万甚至数十亿个不同的纬度/经度点需要进行比较.我一直在寻找许多不同的实现来完成我需要完成的工作.我研究过 Postgis(看起来它很受欢迎并且性能很好)、Neo4J(图形数据库对我来说是一个新概念,我不确定它们的表现如何)、AWS dynamodb geohash(扩展性很好,但只有库是用Java,我希望在 node.js 中编写一个库)等,但无法确定哪个性能最好.我纯粹是在研究与功能数量相反的性能.我需要能够将一个点与所有点进行比较并找到最接近的(读取操作),并且能够快速更改数据库中的一个点(写入操作).任何人都可以根据这些要求建议一个好的实施

I am trying to figure out which GEO implementation to use to find the nearest points based on long/lat to a certain point. I will have millions if not billions of different latitude/longitude points that will need to be compared. I have been looking at many different implementations to do the job I need to be done. I have looked into Postgis (looks like it is very popular and performs well), Neo4J (Graph databases are a new concept to me and I am unsure how they perfrom), AWS dynamodb geohash (Scales very well, but only library is written in Java, I am hoping to write a library in node.js), etc but can't figure out which would perform best. I am purely looking into performance opposed to number of features. All I need to be able is to compare one point to all points and find the closest (read operation), and as well, be able to change a point in the database quickly (write operation). Could anyone suggest based on these requirements a good implementation

推荐答案

PostGIS 有几个用于 geohashing 的功能.如果你让你的字符串足够长,搜索会变得更快(每个盒子的碰撞更少 + 它的 8 个邻居),但在插入新点时,geohash 生成会更慢.

PostGIS has several function for geohashing. If you make your strings long enough the search becomes quicker (fewer collisions per box + its 8 neighbours) but the geohash generation slower on inserting new points.

问题还在于您希望达到的准确程度.随着纬度的增加,纬度/经度距离"变差,因为经度从赤道的大约 110 公里缩小到两极的 0,而纬度总是大约 110 公里.在 45 度的中纬度,经度接近 79 公里,距离误差为 2 (sqr(110/79)).为您提供纬度/经度对之间真实距离的球面距离计算起来非常昂贵(大量三角函数正在进行),然后您的地理散列将不起作用(除非您将所有点转换为平面坐标).

The question is also how accurate you want to be. At increasing latitude, lat/long "distance" deteriorates because a degree of longitude shrinks from about 110km at the Equator to 0 at the poles, while a degree of latitude is always about 110km. At the mid-latitude of 45 degrees a degree of longitude is nearly 79km, giving an error in distance of a factor of 2 (sqr(110/79)). Spherical distance to give you true distance between lat/long pairs is very expensive to calculate (lots of trigonometry going on) and then you geohashing won't work (unless you convert all points to planar coordinates).

可能有效的解决方案如下:

A solution that might work is the following:

  • CREATE INDEX hash8 ON tablename(substring(hash_column FROM 1 FOR 8)).这为您提供了两倍于分辨率的框的索引,这有助于查找点并减少搜索相邻哈希框的需要.
  • 在一个点的 INSERT 上,使用 PostGIS 将其长度为 9(大约 10m 分辨率)的 geohash 计算到 hash_column 中.您可以在此处使用 BEFORE INSERT TRIGGER.
  • CREATE INDEX hash8 ON tablename(substring(hash_column FROM 1 FOR 8)). This gives you an index on a box twice as large as your resolution, which helps finding points and reducing the need to search neighbouring hash boxes.
  • On INSERT of a point, compute its geohash of length 9 (10m resolution approx.) into hash_column, using PostGIS. You could use a BEFORE INSERT TRIGGER here.

在函数中:

  • 给定一个点,通过查找 geohash 值缩短为 8 个字符的所有点来找到最近的点,该值等于给定的点 8-char geohash(因此是上面的索引).
  • 使用球坐标计算到每个遇到的点的距离,保持最近的点.但由于您只是在寻找最近的点(至少最初是这样),不要使用球坐标搜索距离,而是使用下面的优化,这将使搜索速度更快.
  • 计算给定点是否比最近的计算点更靠近由 8-char geohash 确定的框的边缘.如果是,则在其 8 个相邻点的所有点上使用 7-char geohash 重复该过程.这可以通过计算到各个框边和角的距离并仅评估相关的邻居哈希框来高度优化;我把这个留给你去修补.

无论如何,这不会特别快.如果您确实要获得数十亿个点,您可能需要考虑聚类,它有一个相当自然"的地理散列解决方案(例如,在 substring(hash_column FROM 1 FOR 2) 上分解你的表格,给你四个象限).只需确保您考虑了跨境搜索.

In any case, this will not be particularly speedy. If you are indeed going towards billions of points you might want to think about clustering which has a rather "natural" solution for geohashing (break up your table on substring(hash_column FROM 1 FOR 2) for instance, giving you four quadrants). Just make sure that you account for cross-boundary searches.

两项优化可以很快完成:

首先,标准化"您的球面坐标(意思是:补偿随着纬度的增加而减少的经度长度),以便您可以使用伪笛卡尔"搜索最近的点方法.这仅适用于点靠得很近的情况,但由于您正在处理大量点,因此这应该不是问题.更具体地说,这应该适用于长度为 6 或更多的 geohash 框中的所有点.

First, "normalize" your spherical coordinates (meaning: compensate for the reduced length of a degree of longitude with increasing latitude) so that you can search for nearest points using a "pseudo-cartesian" approach. This only works if points are close together, but since you are working with lots of points this should not be a problem. More specifically, this should work for all points in geohash boxes of length 6 or more.

假设 WGS84 椭球体(用于所有 GPS 设备),地球的长轴 (a) 为 6,378,137 米,椭圆度 (e2) 为 0.00669438.一秒经度的长度为

Assuming the WGS84 ellipsoid (which is used in all GPS devices), the Earth's major axis (a) is 6,378,137 meter, with an ellipticity (e2) of 0.00669438. A second of longitude has a length of

longSec := Pi * a * cos(lat) / sqrt(1 - e2 * sqr(sin(lat))) / 180 / 3600

longSec := 30.92208078 * cos(lat) / sqrt(1 - 0.00669438 * sqr(sin(lat)))

一秒钟的纬度:

latSec := 30.870265 - 155.506 * cos(2 * lat) + 0.0003264 + cos(4 * lat)

使您的本地坐标系方形"的校正因子是将您的经度值乘以 longSec/latSec.

The correction factor to make your local coordinate system "square" is by multiplying your longitude values by longSec/latSec.

其次,由于您正在寻找最近的点,因此不要搜索距离,因为平方根的计算成本很高.相反,搜索平方根内的项,如果您愿意,可以搜索平方距离,因为这具有选择最近点的相同属性.

Secondly, since you are looking for the nearest point, do not search on distance because of the computationally expensive square root. Instead, search on the term within the square root, the squared distance if you will, because this has the same property of selecting for the nearest point.

在伪代码中:

CREATE FUNCTION nearest_point(pt geometry, ptHash8 char(8)) RETURNS integer AS $$
DECLARE
  corrFactor double precision;
  ptLat    double precision;
  ptLong     double precision;
  currPt     record;
  minDist    double precision;
  diffLat    double precision;
  diffLong   double precision;
  minId      integer;
BEGIN
  minDist := 100000000.; -- a large value, 10km (squared)
  ptLat := ST_Y(pt);
  ptLong := ST_X(pt);
  corrFactor := 30.92208078 * cos(radians(ptLat)) / (sqrt(1 - 0.00669438 * power(sin(radians(ptLat)), 2)) *
                (30.870265 - 155.506 * cos(2 * radians(ptLat)) + 0.0003264 + cos(4 * radians(ptLat))));
  FOR currPt IN SELECT * FROM all_points WHERE hash8 = ptHash8
  LOOP
    diffLat := ST_Y(currPt.pt) - ptLat;
    diffLong := (ST_X(currPt.pt) - ptLong) * corrFactor; -- "square" things out
    IF (diffLat * diffLat) < (minDist * diffLong * diffLong) THEN -- no divisions here to speed thing up a little further
      minDist := (diffLat * diffLat) / (diffLong * diffLong); -- this does not happen so often
      minId := currPt.id;
    END IF;
  END LOOP;
  IF minDist < 100000000. THEN
    RETURN minId;
  ELSE
    RETURN NULL;
  END IF;
END; $$ LANGUAGE PLPGSQL STRICT;

不用说,这在 C 语言函数中会快很多.另外,不要忘记做边界检查,看看是否需要搜索相邻的 geohash 框.

Needless to say, this would be a lot faster in a C language function. Also, do not forget to do boundary checking to see if neighbouring geohash boxes need to be searched.

顺便说一句,空间纯粹主义者"不会在 8 个字符的 geohash 上建立索引并从那里搜索;相反,他们将从 9-char 哈希开始并从那里向外工作.但是,初始散列框中的未命中"(因为没有其他点或您靠近散列框一侧)是昂贵的,因为您必须开始计算与相邻散列框的距离并提取更多数据.在实践中,你应该使用一个大约是典型最近点两倍大小的哈希框;该距离是多少取决于您的点集.

Incidentally, "spatial purists" would not index on the 8-char geohash and search from there; instead they would start from the 9-char hash and work outwards from there. However, a "miss" in your initial hash box (because there are no other points or you are close to a hash box side) is expensive because you have to start computing distances to neighbouring hash boxes and pull in more data. In practice you should work from a hash box which is about twice the size of the typical nearest point; what that distance is depends on your point set.

这篇关于用于数百万点的 GEO 实现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆