如何使用Google App Engine搜索API找到最接近的文档? [英] How can I find the closest document using Google App Engine Search API?

查看:171
本文介绍了如何使用Google App Engine搜索API找到最接近的文档?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 GAE搜索索引中拥有约400,000个文档。所有文件都有一个位置 GeoPoint 属性,并分布在整个地球上。有些文件可能距离任何其他文件超过4000公里,其他文件可能在彼此的米之内。



我想找到最接近的文件,坐标,但找到以下代码会得到不正确的结果:

  from google.appengine.api导入搜索

#coords是一个元组的形式,例如(50.123,1.123)
search.document(
doc_id ='有意义的唯一ID',
字段= [search.GeoField(name ='location'
value = search .GeoPoint(coords [0],coords [1]))])

#查找文档函数radius以米为单位
def find_document(coords,radius = 1000000):
sort_expr = search.SortExpression(
表达式='distance(location,locationsoint(%。3f,%.3f))'%coords,
direction = search.SortExpression.ASCENDING,
default_value = 0)

search_query = search.Query(
query_string ='distance(location,geopoint(%。3f,%.3f))<%d'\
% (coords [0],coords [1],radius),
options = search.QueryOptions(
limit = 1,
ids_only = True,
sort_options = search.SortOptions(表达式= [sort_expr])))

index = search.Index(name ='document-index')
return index.search(search_que ry)

使用这段代码我会得到一致但不正确的结果。例如,搜寻伦敦最近的文件显示最近的文件是在苏格兰。我已经验证了数千个更接近的文档。



我将问题缩小到 radius 参数也是如此大。如果半径下降到12km左右( radius = 12000 ),我会得到正确的结果。在12公里范围内通常不会有超过1000个文件。 (可能与 search.SortOptions(limit = 1000)相关联。)



问题是如果我在一个地球稀疏的地区,在那里没有任何文件数千英里,我的搜索功能将不会返回任何与 radius = 12000 (12km)的东西。无论我在哪里,我都希望它将最接近的文档返回给我。我怎么能一次调用Search API来实现这一点?


您的查询将选择最多10K个文档,然后根据您的距离排序表达式对其进行排序并返回。 (也就是说,这种排序实际上并不是所有的400k文件。)
所以我怀疑这个10k选项中没有包含一些地理上较近的点。
这就是为什么当您缩小搜索半径时事情会变得更好,因为您在该半径中的总分数较少。



实质上,您希望查询'命中到10K,这对于你所查询的内容是有意义的。
您可以通过以下几种方式解决这个问题,您可以结合使用:


  • 添加排名,以便大多数重要文档(按照您的域中有意义的一些标准)按排名顺序返回,然后这些文档将按距离排序。

  • 过滤一个或多个文档字段)(例如,如果您的文档包含有关企业的信息,则为'business category')以减少候选文档的数量。


不相信这个10k阈值目前在Search API文档中;我已经提交了一张票来获得它)。


I have approximately 400,000 documents in a GAE Search index. All documents have a location GeoPoint property and are spread over the entire globe. Some documents might be over 4000km away from any other document, others might be bunched within meters of each other.

I would like to find the closest document to a specific set of coordinates but find the following code gives incorrect results:

from google.appengine.api import search

# coords are in the form of a tuple e.g. (50.123, 1.123)
search.Document(
    doc_id='meaningful-unique-id',
    fields=[search.GeoField(name='location' 
                            value=search.GeoPoint(coords[0], coords[1]))])

# find document function radius is in metres
def find_document(coords, radius=1000000):
    sort_expr = search.SortExpression(
        expression='distance(location, geopoint(%.3f, %.3f))' % coords,
        direction=search.SortExpression.ASCENDING,
        default_value=0)

    search_query = search.Query(
        query_string='distance(location, geopoint(%.3f, %.3f)) < %d' \
                    % (coords[0], coords[1], radius),
        options=search.QueryOptions(
            limit=1,
            ids_only=True,
            sort_options=search.SortOptions(expressions=[sort_expr])))

    index = search.Index(name='document-index')
    return index.search(search_query)

With this code I will get results that are consistent but incorrect. For example, a search for the nearest document to London indicated the closest one was in Scotland. I have verified that there are thousands of closer documents.

I narrowed the problem down to the radius parameter being too large. I get correct results if the radius is down to around 12km (radius=12000). There are generally no more than 1000 documents in a 12 km radius. (Probably associated with search.SortOptions(limit=1000).)

The problem is that if I am in a sparse area of the globe where there aren't any documents for thousands of miles, my search function will not return anything with radius=12000 (12km). I want it to return the closest document to me wherever I am. How can I accomplish this consistently with one call to the Search API?

解决方案

I believe the issue is the following. Your query will select up to 10K documents, then those are sorted according to your distance sort expression and returned. (That is, the sort is in fact not over all 400k documents.) So I suspect that some of the geographically closer points are not included in this 10k selection. That's why things work better when you narrow your search radius, as you have fewer total points in that radius.

Essentially, you want to get your query 'hits' down to 10k, in a manner that makes sense for what you are querying on. You can address this in at least a couple of ways, which you can combine:

  • Add a ranking, so that the most 'important' docs (by some criteria that makes sense in your domain) are returned in rank order, then these will be sorted by distance.
  • Filter on one or more document field(s) (e.g., 'business category', if your docs contain information about businesses) to reduce the number of candidate docs.

(I don't believe this 10k threshold is currently in the Search API documentation; I've filed a ticket to get it added).

这篇关于如何使用Google App Engine搜索API找到最接近的文档?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆