使用MapReduce时,ndb模型不会保存在memcache中 [英] ndb Models are not saved in memcache when using MapReduce

查看:150
本文介绍了使用MapReduce时,ndb模型不会保存在memcache中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我创建了两个MapReduce管道,用于上传CSV文件以批量创建类别和产品。每个产品都通过KeyProperty绑定到类别。类别和产品模型建立在ndb.Model上,所以基于文档,我认为从数据存储库中检索时,它们会自动缓存在Memcache中。



我已经在服务器上运行这些脚本来上传30个类别,然后再上传3000个产品。所有数据都按预期的方式出现在数据存储区中。



但是,似乎产品上传并未使用Memcache来获取类别。当我检查门户网站中的Memcache查看器时,它说明了命中计数大约为180,并且计数错误数在60左右。如果我每次上传3000个产品并检索该类别,应该不会有大约3000命中+错过获取类别(即Category.get_by_id(category_id))?在创建一个新的产品(算法可以处理实体创建和更新)之前,尝试检索现有产品的尝试可能还有3000多次。



以下是相关的产品映射函数,它从CSV文件中取出一行来创建或更新产品:

pre $ def $ product_bulk_import_map b产品批量导入地图功能。

result = {status:CREATED}
product_data =数据

尝试:
#parse输入参数元组
byteoffset,line_data = data

#解析基本产品数据
product_data = [x for csv.reader([line_data])] [0]
(p_id,c_id,p_type,p_description)= product_data

#进程类别
category = Category.get_by_id(c_id)
如果category为None:
举例异常(product_import_error_messages [category]%c_id)

#s如果产品不是无:
result [status] =UPDATED
product.category = category.key
product.product_type = p_type
product.description = p_description
else:
product = Product(
id = p_id,
category = category.key,
product_type = p_type,
description = p_description

product.put()
result [entity] = product.to_dict()
异常作为e:
#捕获任何异常,并注意输出失败
result [status] =FAILED
result [entity] = str(e)

#返回结果
yield(str(product_data),result)


解决方案

MapReduce故意为NDB禁用内存缓存。



请参阅 mapreduce / util。 py ln 373, _set_ndb_cache_policy()(截至2015-05-01):

  def _set_ndb_cache_policy():
告诉NDB永远不会在memcache或进程中缓存任何内容。

这可确保通过NDB
从数据存储区input_readers获取的实体不会膨胀请求内存大小,Datastore Puts将避免
调用memcache。没有这个,你会得到软内存限制退出,
会损害整体吞吐量。

ndb_ctx = ndb.get_context()
ndb_ctx.set_cache_policy(lambda key:False)
ndb_ctx.set_memcache_policy(lambda key:False)

您可以强制 get_by_id()放入()来使用memcache,例如:

  product = Product.get_by_id(p_id,use_memcache = True )
...
product.put(use_memcache = True)

另外,你可以修改NDB上下文,如果你正在批量放入 mapreduce.operation ,但我不知道这是否有其他不良影响:

  ndb_ctx = ndb.get_context()
ndb_ctx.set_memcache_policy(lambda键:True)
...
yield operation.db.Put(product)

至于关于软内存限制退出,我不明白为什么会发生如果只启用了memcache (即没有上下文缓存)。



它实际上y看起来像你想让memcache被启用put,否则你的app最终会在你的mapper修改了下面的数据后从NDB的memcache中读取陈旧的数据。

I've created two MapReduce Pipelines for uploading CSVs files to create Categories and Products in bulk. Each product is gets tied to a Category through a KeyProperty. The Category and Product models are built on ndb.Model, so based on the documentation, I would think they'd be automatically cached in Memcache when retrieved from the Datastore.

I've run these scripts on the server to upload 30 categories and, afterward, 3000 products. All the data appears in the Datastore as expected.

However, it doesn't seem like the Product upload is using Memcache to get the Categories. When I check the Memcache viewer in the portal, it says something along the lines of the hit count being around 180 and the miss count around 60. If I was uploading 3000 products and retrieving the category each time, shouldn't I have around 3000 hits + misses from fetching the category (ie, Category.get_by_id(category_id))? And likely 3000 more misses from attempting to retrieve the existing product before creating a new one (algorithm handles both entity creation and updates).

Here's the relevant product mapping function, which takes in a line from the CSV file in order to create or update the product:

def product_bulk_import_map(data):
    """Product Bulk Import map function."""

    result = {"status" : "CREATED"}
    product_data = data

    try:
        # parse input parameter tuple
        byteoffset, line_data = data

        # parse base product data
        product_data = [x for x in csv.reader([line_data])][0]
        (p_id, c_id, p_type, p_description) = product_data

        # process category
        category = Category.get_by_id(c_id)
        if category is None:
            raise Exception(product_import_error_messages["category"] % c_id)

        # store in datastore
        product = Product.get_by_id(p_id)
        if product is not None:
            result["status"] = "UPDATED"
            product.category = category.key
            product.product_type = p_type
            product.description = p_description
        else:
            product = Product(
                id = p_id,
                category = category.key,
                product_type = p_type,
                description = p_description
            )
        product.put()
        result["entity"] = product.to_dict()
    except Exception as e:
        # catch any exceptions, and note failure in output
        result["status"] = "FAILED"
        result["entity"] = str(e)

    # return results
    yield (str(product_data), result)

解决方案

MapReduce intentionally disables memcache for NDB.

See mapreduce/util.py ln 373, _set_ndb_cache_policy() (as of 2015-05-01):

def _set_ndb_cache_policy():
  """Tell NDB to never cache anything in memcache or in-process.

  This ensures that entities fetched from Datastore input_readers via NDB
  will not bloat up the request memory size and Datastore Puts will avoid
  doing calls to memcache. Without this you get soft memory limit exits,
  which hurts overall throughput.
  """
  ndb_ctx = ndb.get_context()
  ndb_ctx.set_cache_policy(lambda key: False)
  ndb_ctx.set_memcache_policy(lambda key: False)

You can force get_by_id() and put() to use memcache, eg:

product = Product.get_by_id(p_id, use_memcache=True)
...
product.put(use_memcache=True)

Alternatively, you can modify the NDB context if you are batching puts together with mapreduce.operation. However I don't know enough to say whether this has other undesired effects:

ndb_ctx = ndb.get_context()
ndb_ctx.set_memcache_policy(lambda key: True)
...
yield operation.db.Put(product)

As for the docstring about "soft memory limit exits", I don't understand why that would occur if only memcache was enabled (ie. no in-context cache).

It actually seems like you want memcache to be enabled for puts, otherwise your app ends up reading stale data from NDB's memcache after your mapper has modified the data underneath.

这篇关于使用MapReduce时,ndb模型不会保存在memcache中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆