如何在使用MapReduce API映射到云存储之前过滤数据存储区数据? [英] How to filter datastore data before mapping to cloud storage using the MapReduce API?
问题描述
关于代码实验室 here ,我们如何过滤mapreduce作业中的数据存储区数据,而不是获取某个实体类型的所有对象?
在下面的mapper管道定义中,只有一个输入读取器参数是要处理的实体类型,并且我无法在InputReader类中看到类型过滤器的其他参数可以提供帮助。
output = yield mapreduce_pipeline.MapperPipeline(
Datastore Mapper%s%entity_type,
main.datastore_map,
mapreduce.input_readers.DatastoreInputReader,
output_writer_spec =mapreduce.output_writers.FileOutputWriter,
params = {
input_reader:{
entity_kind:entity_type,
},
output_writer :{
filesystem:gs ,
gs_bucket_name:GS_BUCKET,
output_sharding:none,
}
},
shards = 100)
$ c由于Google BigQuery在非标准化数据模型中表现更好,因此能够从多个数据存储实体类型(JOIN)构建一个表是很好的,但我看不到如何做到这一点?解决方案根据您的应用程序,您可能可以解决此问题通过传递一个过滤器参数,该参数是可选的过滤器列表以应用于查询。每个过滤器都是一个元组:(< property_name_as_str>,< query_operation_as_str>,<值>
。
所以,在你的输入阅读器参数中:
input_reader:{
entity_kind:entity_type,
filters:[(datastore_property,=,12345),
(another_datastore_property,>,200)]
}
Regarding the code lab here, how can we filter datastore data within the mapreduce jobs rather than fetching all objects for a certain entity kind?
In the mapper pipeline definition below, the only one input reader parameter is the entity kind to process and I can't see other parameters of type filter in the InputReader class that could help.
output = yield mapreduce_pipeline.MapperPipeline(
"Datastore Mapper %s" % entity_type,
"main.datastore_map",
"mapreduce.input_readers.DatastoreInputReader",
output_writer_spec="mapreduce.output_writers.FileOutputWriter",
params={
"input_reader":{
"entity_kind": entity_type,
},
"output_writer":{
"filesystem": "gs",
"gs_bucket_name": GS_BUCKET,
"output_sharding":"none",
}
},
shards=100)
Since Google BigQuery plays better with unormalized data model, it would be nice to be able to build one table from several datastore entity kinds (JOINs) but I can't see how to do so as well?
解决方案 Depending on your application, you might be able to solve this by passing a filter parameter which is "an optional list of filters to apply to the query. Each filter is a tuple: (<property_name_as_str>, <query_operation_as_str>, <value>
."
So, in your input reader parameters:
"input_reader":{
"entity_kind": entity_type,
"filters": [("datastore_property", "=", 12345),
("another_datastore_property", ">", 200)]
}
这篇关于如何在使用MapReduce API映射到云存储之前过滤数据存储区数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!