使用 MapReduce 时,ndb 模型不会保存在内存缓存中 [英] ndb Models are not saved in memcache when using MapReduce

查看:22
本文介绍了使用 MapReduce 时,ndb 模型不会保存在内存缓存中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我创建了两个 MapReduce 管道,用于上传 CSV 文件以批量创建类别和产品.每个产品都通过 KeyProperty 绑定到一个类别.类别和产品模型建立在 ndb.Model 上,因此根据文档,我认为从数据存储区检索它们时,它们会自动缓存在 Memcache 中.

I've created two MapReduce Pipelines for uploading CSVs files to create Categories and Products in bulk. Each product is gets tied to a Category through a KeyProperty. The Category and Product models are built on ndb.Model, so based on the documentation, I would think they'd be automatically cached in Memcache when retrieved from the Datastore.

我在服务器上运行了这些脚本,上传了 30 个类别,然后上传了 3000 个产品.所有数据都按预期显示在数据存储区中.

I've run these scripts on the server to upload 30 categories and, afterward, 3000 products. All the data appears in the Datastore as expected.

然而,产品上传似乎并没有使用 Memcache 来获取类别.当我检查门户中的 Memcache 查看器时,它说命中数大约为 180,未命中数大约为 60.如果我上传 3000 个产品并每次检索类别,我不应该有大约 3000 个从获取类别(即 Category.get_by_id(category_id))中命中 + 未命中?在创建新产品之前尝试检索现有产品可能会导致 3000 次以上的失误(算法处理实体创建和更新).

However, it doesn't seem like the Product upload is using Memcache to get the Categories. When I check the Memcache viewer in the portal, it says something along the lines of the hit count being around 180 and the miss count around 60. If I was uploading 3000 products and retrieving the category each time, shouldn't I have around 3000 hits + misses from fetching the category (ie, Category.get_by_id(category_id))? And likely 3000 more misses from attempting to retrieve the existing product before creating a new one (algorithm handles both entity creation and updates).

这是相关的产品映射函数,它从 CSV 文件中获取一行以创建或更新产品:

Here's the relevant product mapping function, which takes in a line from the CSV file in order to create or update the product:

def product_bulk_import_map(data):
    """Product Bulk Import map function."""

    result = {"status" : "CREATED"}
    product_data = data

    try:
        # parse input parameter tuple
        byteoffset, line_data = data

        # parse base product data
        product_data = [x for x in csv.reader([line_data])][0]
        (p_id, c_id, p_type, p_description) = product_data

        # process category
        category = Category.get_by_id(c_id)
        if category is None:
            raise Exception(product_import_error_messages["category"] % c_id)

        # store in datastore
        product = Product.get_by_id(p_id)
        if product is not None:
            result["status"] = "UPDATED"
            product.category = category.key
            product.product_type = p_type
            product.description = p_description
        else:
            product = Product(
                id = p_id,
                category = category.key,
                product_type = p_type,
                description = p_description
            )
        product.put()
        result["entity"] = product.to_dict()
    except Exception as e:
        # catch any exceptions, and note failure in output
        result["status"] = "FAILED"
        result["entity"] = str(e)

    # return results
    yield (str(product_data), result)

推荐答案

MapReduce 有意禁用 NDB 的内存缓存.

MapReduce intentionally disables memcache for NDB.

参见 mapreduce/util.py ln 373, _set_ndb_cache_policy()(截至 2015-05-01):

See mapreduce/util.py ln 373, _set_ndb_cache_policy() (as of 2015-05-01):

def _set_ndb_cache_policy():
  """Tell NDB to never cache anything in memcache or in-process.

  This ensures that entities fetched from Datastore input_readers via NDB
  will not bloat up the request memory size and Datastore Puts will avoid
  doing calls to memcache. Without this you get soft memory limit exits,
  which hurts overall throughput.
  """
  ndb_ctx = ndb.get_context()
  ndb_ctx.set_cache_policy(lambda key: False)
  ndb_ctx.set_memcache_policy(lambda key: False)

你可以强制get_by_id()put()使用memcache,例如:

You can force get_by_id() and put() to use memcache, eg:

product = Product.get_by_id(p_id, use_memcache=True)
...
product.put(use_memcache=True)

或者,如果您使用 mapreduce.operation 将集合放在一起,则可以修改 NDB 上下文.但是我不太清楚这是否有其他不良影响:

Alternatively, you can modify the NDB context if you are batching puts together with mapreduce.operation. However I don't know enough to say whether this has other undesired effects:

ndb_ctx = ndb.get_context()
ndb_ctx.set_memcache_policy(lambda key: True)
...
yield operation.db.Put(product)

至于关于软内存限制退出"的文档字符串,我不明白如果只启用了内存缓存(即没有上下文缓存),为什么会发生这种情况.

As for the docstring about "soft memory limit exits", I don't understand why that would occur if only memcache was enabled (ie. no in-context cache).

实际上您似乎希望为 puts 启用内存缓存,否则在映射器修改下面的数据后,您的应用程序最终会从 NDB 的内存缓存中读取陈旧数据.

It actually seems like you want memcache to be enabled for puts, otherwise your app ends up reading stale data from NDB's memcache after your mapper has modified the data underneath.

这篇关于使用 MapReduce 时,ndb 模型不会保存在内存缓存中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆