如何在Scrapy中更新DjangoItem [英] How to update DjangoItem in Scrapy

查看:330
本文介绍了如何在Scrapy中更新DjangoItem的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在使用Scrapy,但遇到一个问题。



DjangoItem 有一个保存方法以使用Django ORM来保存项目。这是伟大的,除了如果我多次运行刮刀,新的项目将在数据库中创建,即使我可能只想更新以前的值。



之后看看文档和源代码,我没有看到任何更新现有项目的方法。



我知道我可以打电话给ORM来查看是否有项目存在并更新它,但这意味着每个单个对象调用数据库,然后再次保存该项。



如果已经存在,我如何更新项目?

解决方案

不幸的是,我发现完成此工作的最佳方式是做出正确的说明:检查项目是否存在在数据库中使用 django_model.objects.get ,然后更新它。



在我的设置文件中,我添加了新的管道:

  ITEM_PIPELINES = {
#...
#最后一个管道,因为进一步变更将无法保存。
'apps.scrapy.pipelines.ItemPersistencePipeline':999
}

我创建了一些帮助方法来处理创建项目模型的工作,并在必要时创建一个新的方法:

  def item_to_model )
model_class = getattr(item,'django_model')
如果不是model_class:
raise TypeError(Item不是DjangoItem或配置错误)

return item.instance


def get_or_create(model):
model_class = type(model)
created = False

#通常我们使用`get_or_create`。但是,`get_or_create`将
#匹配对象的所有属性(即,在任何时候更改新的对象
#),而不是更新现有对象。

#相反,我们分别执行两个步骤
try:
#我们目前没有唯一的标识符;现在使用这个名字。
obj = model_class.objects.get(name = model.name)
除了model_class.DoesNotExist:
created = True
obj = model#DjangoItem为我们创建了一个模型。

return(obj,created)


def update_model(destination,source,commit = True):
pk = destination.pk $ b $对于source_dict.items()中的(key,value):
setattr(destination,key,value)

setattr(目的地,'pk',pk)

如果提交:
destination.save()

返回目的地
/ pre>

然后,最后的管道是相当简单的:

  class itemPersistencePipeline(object):
def process_item(self,item,spider):
try:
item_model = item_to_model(item)
除了TypeError:
return item

model,created = get_or_create(item_model)

update_model(model,item_model)

返回项


I've been working with Scrapy but run into a bit of a problem.

DjangoItem has a save method to persist items using the Django ORM. This is great, except that if I run a scraper multiple times, new items will be created in the database even though I may just want to update a previous value.

After looking at the documentation and source code, I don't see any means to update existing items.

I know that I could call out to the ORM to see if an item exists and update it, but it would mean calling out to the database for every single object and then again to save the item.

How can I update items if they already exist?

解决方案

Unfortunately, the best way that I found to accomplish this is to do exactly what was stated: Check if the item exists in the database using django_model.objects.get, then update it if it does.

In my settings file, I added the new pipeline:

ITEM_PIPELINES = {
    # ...
    # Last pipeline, because further changes won't be saved.
    'apps.scrapy.pipelines.ItemPersistencePipeline': 999
}

I created some helper methods to handle the work of creating the item model, and creating a new one if necessary:

def item_to_model(item):
    model_class = getattr(item, 'django_model')
    if not model_class:
        raise TypeError("Item is not a `DjangoItem` or is misconfigured")

    return item.instance


def get_or_create(model):
    model_class = type(model)
    created = False

    # Normally, we would use `get_or_create`. However, `get_or_create` would
    # match all properties of an object (i.e. create a new object
    # anytime it changed) rather than update an existing object.
    #
    # Instead, we do the two steps separately
    try:
        # We have no unique identifier at the moment; use the name for now.
        obj = model_class.objects.get(name=model.name)
    except model_class.DoesNotExist:
        created = True
        obj = model  # DjangoItem created a model for us.

    return (obj, created)


def update_model(destination, source, commit=True):
    pk = destination.pk

    source_dict = model_to_dict(source)
    for (key, value) in source_dict.items():
        setattr(destination, key, value)

    setattr(destination, 'pk', pk)

    if commit:
        destination.save()

    return destination

Then, the final pipeline is fairly straightforward:

class ItemPersistencePipeline(object):
    def process_item(self, item, spider):
        try:
             item_model = item_to_model(item)
        except TypeError:
            return item

        model, created = get_or_create(item_model)

        update_model(model, item_model)

        return item

这篇关于如何在Scrapy中更新DjangoItem的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆