标记与纬度/元素巨大的名单长,地理定位数据的大名单 [英] Tag huge list of elements with lat/long with large list of geolocation data

查看:223
本文介绍了标记与纬度/元素巨大的名单长,地理定位数据的大名单的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个巨大的地理定位事件的列表:

I have a huge list of geolocation events:

Event (1 billion)
------
id
datetime
lat
long

和从开放街道地图加载兴趣点的列表:

And a list of point of interest loaded from open street map:

POI (1 million)
------
id
tag   (shop, restaurant, etc.)
lat
long

我想分配给每个对每个事件的兴趣点的标记。什么是实现这一问题的最佳架构?我们尝试使用谷歌的BigQuery,但我们必须做一个交叉连接,它不工作。我们是开放使用任何其他大数据系统。

I would like to assign to each to each event the tag of the point of interest. What is the best architecture to achieve this problem? We tried using Google BigQuery but we have to do a cross join and it does not work. We are open to use any other big data system.

推荐答案

使用数据流,你可以做一个交叉连接pretty容易使用的 CoGroupByKey 。使用这种方法只是事件和POI要加入需要装入内存(数据流就会自动溢出到磁盘的项目是否为指定键的列表是太大,无法在内存中)。

Using Dataflow you can do a cross join pretty easily using CoGroupByKey. Using this approach only the Event and POI you are joining would need to fit in memory (Dataflow will automatically spill to disk if the list of items for a given key is too large to fit in memory).

下面是一些细节。


  • 创建经度和纬度键入事件的PCollection。

  • 创建经度和纬度键入POI的PCollection

  • 使用一个CoGroupByKey加入这两个PCollections。

  • 编写处理<一个一个DoFn href=\"https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/transforms/join/CoGbkResult\"相对=nofollow> CoGbkResult

  • 该DoFn看起来是这样的:

  • Create a PCollection of events keyed by latitude and longitude.
  • Create a PCollection of POI keyed by latitude and longitude
  • Use a CoGroupByKey to join the two PCollections.
  • Write a DoFn that processes the CoGbkResult
  • The DoFn would look something like:


PCollection<T> finalResultCollection =
coGbkResultCollection.apply(ParDo.of(
  new DoFn<KV<K, CoGbkResult>, T>() {
    @Override
    public void processElement(ProcessContext c) {
      KV<K, CoGbkResult> e = c.element();
      // Get all collection 1 values
      Iterable<Event> eventVals = e.getValue().getAll(eventTag);
      // Now get collection 2 values
      Iterable<Poi> poiVals = e.getValue().getAll(poiTag);
      for (Event e : eventVals) {
        for (Poi p : poiVal) {
          ...
          c.output(...tagged event...);
        }
      }
    }
  }));

如本<一个讨论href=\"http://stackoverflow.com/questions/33254689/best-strategy-for-joining-two-large-datasets\">Answer你也可以使用一个侧输入传递一个映射,其关键字是经度和纬度的值是一个POI的详细信息。这种做法会工作,如果数据可以装入内存。如果你只有百万POI,你只存储领域的上市它可能会适合在内存中。

As discussed in this Answer you could also use a side input to pass a map whose keys were latitude and longitude and the values were the details of a POI. That approach will work if the data can fit in memory. If you only have 1 million POI and you are only storing the fields listed it will probably fit in memory.

请注意:我对数据流球队

Note: I'm on the Dataflow team.

这篇关于标记与纬度/元素巨大的名单长,地理定位数据的大名单的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆