在SQL Server 2008上对7000万个极高密度空间点云进行优化最近邻查询 [英] optimize nearest neighbor query on 70 million extremely high density spatial point cloud on SQL Server 2008

查看:116
本文介绍了在SQL Server 2008上对7000万个极高密度空间点云进行优化最近邻查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在SQL Server 2008 R2 Express数据库中有大约7500万条记录。每个是长度对应于某个值的lat。该表有地理专栏。我试图找到一个给定纬度经度(点)的最近邻居。我已经有一个空间索引的查询。但是,根据记录在数据库中的位置,例如第一季度或最后一个季度,查询可能需要大约3到30秒才能找到最近的邻居。我觉得这可以通过优化查询或空间索引进行优化,以获得更快的结果。
现在使用默认设置应用了一些空间索引。这是我的表和查询的样子。

I have about 75 million records in a SQL Server 2008 R2 Express database. Each is a lat long corresponding to some value. The table has geography column. I am trying to find one nearest neighbor for a given latitude longitude (point). I already have a query with spatial index in place. But depending on where the record is in the database, say first quarter or last quarter, the query can take about from 3 to 30 seconds to find the nearest neighbor. I feel this can be optimized to give lot faster result by optimizing the query or spatial index. Right now applied some the spatial index with default settings. Here is what my table and query looks like.

CREATE TABLE lidar(
    [id] [bigint] IDENTITY(1,1) NOT NULL,
    [POINTID] [int] NOT NULL,
    [GRID_CODE] [numeric](17, 8) NULL,
    [geom] [geography] NULL,
 CONSTRAINT [PK_lidar_1] PRIMARY KEY CLUSTERED ([id] ASC)
 WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, 
 ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

我正在使用的空间索引:

The spatial Index i am using:

CREATE SPATIAL INDEX [SPATIAL_lidar] ON [dbo].[lidar] ([geom]) USING  GEOGRAPHY_GRID 
WITH (
GRIDS =(LEVEL_1 = MEDIUM,LEVEL_2 = MEDIUM,LEVEL_3 = MEDIUM,LEVEL_4 = MEDIUM), 
CELLS_PER_OBJECT = 16, PAD_INDEX  = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF,  
ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]

这是我正在使用的查询:

Here is the Query I am using:

declare @ms_at geography = 'POINT (-95.66 30.04)';
select TOP(1) nearPoints.geom.STAsText()as latlon 
from
(
select r.geom
from lidar r With(Index(SPATIAL_lidar))
where r.geom.STIntersects(@ms_at.STBuffer(1000)) = 1
) nearPoints

以下是我数据库中lat long的示例。给出准确性和密度的概念。所有7000万条记录都是针对一个城市的(激光雷达数据)

Here is a sample of lat longs in my database . to give an idea of accuracy and density. All the 70 million records are for one city (Lidar Data)

POINT (-95.669434934023087 30.049513838913736)

现在这个查询给出了我上面描述的结果,但我希望尽可能地提高性能。我的猜测是通过调整空间索引的默认值,我可能能够提高性能。这有什么线索吗?

Now this query gives me results as i described above, but i want to improve the performance as much as possible. My guess is by tweaking the default values of the spatial index i might be able to improve the performance. Any clues this?

我尝试将缓冲区从10改为1000但结果几乎相同。

I tried varying the buffer from 10 to 1000 but with almost same results.

还有其他任何建议提高性能是受欢迎的。

Also any other suggestions to improve the performance are welcome.

这是我现在使用的系统:

Here is the system i am using right now:

Windows 7 64bit Professional
Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (4 CPUs), ~3.0GHz
Ram: 8 GB
NVIDIA GeForce 9500 GT


推荐答案

抱歉,这不是SQL的答案,而是一种获取方式假设您的数据存在某些限制,可预测的性能。

Sorry, but this is not a SQL answer, but a way of getting predictable performance assuming certain constraints on your data.

数据的变化频率如何?如果可能,您是否可以预先计算每个实体5最近邻居的图表,并使用它来加快您的选择。?

How often is the data changing? If possible, could you pre-calculate a graph of every entity 5 nearest neighbors, and use that to speed up your selection.?

如果此数据大部分是只读的,那么......

If this data is mostly read only, then...

这些点分布的均匀度如何?如果相当均匀并且知道分布,那么你可以通过在哈希表中分解每个坐标和索引来滚动自己的空间映射。

How evenly are these points distributed? If fairly evenly and well know distribution, then could you roll your own spacial mapping by bucketing every coordinate and index in a hash table.

如果你不需要数据库中的数据,将其移出到内存映射文件以进行快速哈希查找。 (70米记录应该很容易放入内存中。)

If you don't need to have the data in the database, move it out to a memory mapped file for fast hash lookups. (70m records should easily fit in memory).

我使用这种架构来生成亚毫秒查找,以显示广告和搜索引擎的相关性。

I have use this architecture to generate sub-millisecond lookups for for display advertising and search engine relevance.

==精化==

您只需创建一个固定大小的正方形网格(如棋盘) ),并将每个点映射到网格中,并创建一个属于每个网格框的对象列表 - 如果你正确调整每个框的大小,你应该平均有5-50个点每个方块 - 原则上这是一个四叉树,但为了简单起见没有树。

You simply create a grid of fixed size squares (like a chessboard), and you map each point into the grid, and you create a list of the objects witch belong into each of the grid-boxes -- if you adjust the size of each box correctly, you should on average have 5-50 points in each square -- This is in principle a quad-tree but without the tree for simplicity.

对于每个桶中的所有数据分散后的空桶,添加包含数据的最近存储桶的信息。

For each bucket which is empty after you have scattered all the data in buckets, you add information of which nearest buckets that contain data.

现在,您可以对每个存储桶进行从左到右的行编号,以便每个存储桶都具有唯一的可以从坐标计算的数字 - 并将每个桶插入哈希表,或者如果空间pe只是直接查找表。

You can now number each bucket left-to-right-line-ny-line so that each bucket have a unique number which can be calculated from the coordinates -- and you insert each bucket into a hash table, or if space permitting just as a straight lookup table.

现在当你有了查询时,你只需计算将映射到哪个桶,你就会得到一个对象列表。那个桶,或者你会得到一个'空'桶,里面包含指向最近有内容的桶的指针。

Now when you have your query, you simply calculate which bucket that will map to, and you will get either a list of objects in that bucket, or you will get a 'empty' bucket which contains the pointers to the nearest bucket that have content.

这将为你提供第一个候选对象列表你正在寻找,现在你只需要跑步,看看哪一个是最接近的。

That will give you you first candidate list of objects that you are looking for, and now you simply just have to run though and see whcih one is the closest.

在99%的情况下会是这样 - 但是如果你担心那里(a)要么是在下一个桶中的某些条件,实际上是更接近,然后检查周围的8个水桶,看看你是否能找到更近的水桶。

In 99% of cases that would be it -- but if you are worried about that there (a) either are some condidates in the next-over buckets which are actually closer, then just check the 8 surrounding buckets, and see if you can find any closer there.

如果你现在还想得到所有物体的清单最近,然后还为每个对象计算一个5个最近邻域的简单网络,这样你最终会得到一个数据结构,如A - > {B,C,D,E,F},B - > {A,D,G ,H,I},C - > {A,J,K,G,M} ....

If you now also want to get a list of all the objects which are closest, then also calculate a simple network of 5 nearest neigbours for each obejct, so you will end up with a data structure like A->{B,C,D,E,F}, B->{A,D,G,H,I}, C->{A,J,K,G,M}....

这将构成一个简单的网络,您现在可以遍历这里有 Dijkstra 的变体,以获得与您最近点相邻的所有点。

That will form a simple network which you can now traverse with a variation of Dijkstra here to get all the point adjacent to you nearest point.

构建数据结构需要时间,但一旦完成,完成正确查找并返回数据集可以在亚毫秒内完成(不包括任何http或offbox通信)

Building the data structures will take time, but once done, and done right lookup and returning a dataset can be done in sub milliseconds (not including any http or off-box communication ofcause)

希望这会有所帮助。

这篇关于在SQL Server 2008上对7000万个极高密度空间点云进行优化最近邻查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆