如何在使用MapReduce API映射到云存储之前过滤数据存储区数据? [英] How to filter datastore data before mapping to cloud storage using the MapReduce API?

查看:114
本文介绍了如何在使用MapReduce API映射到云存储之前过滤数据存储区数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

关于代码实验室 here ,我们如何过滤mapreduce作业中的数据存储区数据,而不是获取某个实体类型的所有对象?



在下面的mapper管道定义中,只有一个输入读取器参数是要处理的实体类型,并且我无法在InputReader类中看到类型过滤器的其他参数可以提供帮助。

  output = yield mapreduce_pipeline.MapperPipeline(
Datastore Mapper%s%entity_type,
main.datastore_map,
mapreduce.input_readers.DatastoreInputReader,
output_writer_spec =mapreduce.output_writers.FileOutputWriter,
params = {
input_reader:{
entity_kind:entity_type,
},
output_writer :{
filesystem:gs ,
gs_bucket_name:GS_BUCKET,
output_sharding:none,
}
},
shards = 100)
解决方案

根据您的应用程序,您可能可以解决此问题通过传递一个过滤器参数,该参数是可选的过滤器列表以应用于查询。每个过滤器都是一个元组:(< property_name_as_str>,< query_operation_as_str>,<值>



所以,在你的输入阅读器参数中:

 input_reader:{
entity_kind:entity_type,
filters:[(datastore_property,=,12345),
(another_datastore_property,>,200)]
}


Regarding the code lab here, how can we filter datastore data within the mapreduce jobs rather than fetching all objects for a certain entity kind?

In the mapper pipeline definition below, the only one input reader parameter is the entity kind to process and I can't see other parameters of type filter in the InputReader class that could help.

output = yield mapreduce_pipeline.MapperPipeline(
  "Datastore Mapper %s" % entity_type,
  "main.datastore_map",
  "mapreduce.input_readers.DatastoreInputReader",
  output_writer_spec="mapreduce.output_writers.FileOutputWriter",
  params={
      "input_reader":{
          "entity_kind": entity_type,
          },
      "output_writer":{
          "filesystem": "gs",
          "gs_bucket_name": GS_BUCKET,
          "output_sharding":"none",
          }
      },
      shards=100)

Since Google BigQuery plays better with unormalized data model, it would be nice to be able to build one table from several datastore entity kinds (JOINs) but I can't see how to do so as well?

解决方案

Depending on your application, you might be able to solve this by passing a filter parameter which is "an optional list of filters to apply to the query. Each filter is a tuple: (<property_name_as_str>, <query_operation_as_str>, <value>."

So, in your input reader parameters:

"input_reader":{
          "entity_kind": entity_type,
          "filters": [("datastore_property", "=", 12345),
                      ("another_datastore_property", ">", 200)]
}

这篇关于如何在使用MapReduce API映射到云存储之前过滤数据存储区数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆