scrapy:使用 itemLoader 填充嵌套项目 [英] scrapy: Populate nested items with itemLoader

查看:51
本文介绍了scrapy:使用 itemLoader 填充嵌套项目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用 itemLoader 填充这个对象:

I have this object I'm trying to populate with an itemLoader:

{
  "domains": "string",
  "date_insert": "2016-12-23T11:25:00.213Z",
  "title": "string",
  "url": "string",
  "body": "string",
  "date": "2016-12-23T11:25:00.213Z",
  "authors": [
    "string"
  ],
  "categories": [
    "string"
  ],
  "tags": [
    "string"
  ],
  "stats": {
    "views_count": 0,
    "comments_count": 0
  }
}

这是我的 items.py

Here's my items.py

class StatsItem(scrapy.Item):
    views_count=scrapy.Field()
    comments_count=scrapy.Field()

class ArticleItem(scrapy.Item):
    domain = scrapy.Field()
    date_insert=scrapy.Field()
    date_update=scrapy.Field()
    date=scrapy.Field()
    title=scrapy.Field()
    url=scrapy.Field()
    body=scrapy.Field(
        output_processor=Join())
    date=scrapy.Field()
    authors=scrapy.Field(
        output_processor=Identity())
    categories=scrapy.Field(
        output_processor=Identity())
    tags=scrapy.Field()
    stats=scrapy.Field()

我的蜘蛛的一部分:

def parse(self, response):
    loader = ArticleItemLoader(response=response)
    parsed_uri = urlparse(response.url)
    domain = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)

    loader.add_css('authors','span.meta-author')
    loader.add_css('title', 'h1.title-article')
    loader.add_value('url', response.url)
    loader.add_xpath('date_insert', '//div[@class=\'meta\']/time[@itemprop=\'datePublished\']/@datetime')
    loader.add_xpath('date_update', '//div[@class=\'meta\']/time[@itemprop=\'dateModified\']/@datetime')
    loader.add_value('domain', domain)
    loader.add_xpath('categories', '//ul[@class=\'breadcrumbs\']//li[not(contains(@class, \'home\'))]')

到目前为止,我已经成功地填充了除统计数据"之外的所有字段.我已经检查了这个页面在scrapy中嵌套项目数据的正确方法 但它似乎不再工作(我无法让它工作,我的错误是 TypeError: to_unicode must receive a bytes, str or unicode object, got StatsItem)

So far I have succesfuly populating every fields but "stats". I've checked this page correct way to nest Item data in scrapy but it seems to not be working anymore (I can't make it work, my error is TypeError: to_unicode must receive a bytes, str or unicode object, got StatsItem)

我想使用 itemLoader,但我不知道如何用我的 StatsItem 填充我的统计数据"

I'd like to use the itemLoader but I dont see how I could populate my "stats" with my StatsItem

感谢帮助

编辑我很接近,但它仍然不起作用:

Edit I am close but it still doesnt work :

loader.add_value('stats', self.getStats(response))

def getStats(self, response):
    statsLoader = StatsItemLoader(response=response)
    statsLoader.add_xpath('comments_count', '//div[@class=\'btn-count\']//a/text()')
    statsLoader.add_value('views_count', '42')
    return json.dumps(dict(statsLoader.load_item()))

但我的输出是这样的:{[...]"stats": "{\"comments_count\": \"0\", \"views_count\": \"42\"}"}

but my output is like : { [...] "stats": "{\"comments_count\": \"0\", \"views_count\": \"42\"}" }

推荐答案

感谢 @eLRuLL 我设法找到了一个不错的解决方案:

Thanks to @eLRuLL I manage to find a decent solution :

items.py :

class StatsItem(scrapy.Item):
    views_count=scrapy.Field()
    comments_count=scrapy.Field()

class ArticleItem(scrapy.Item):
    [...]
    stats=scrapy.Field(
        input_processor=Identity())


class StatsItemLoader(ItemLoader):
    default_input_processor=MapCompose(remove_tags)
    default_output_processor=TakeFirst()
    default_item_class=StatsItem

spider.py:

def parse(self, response):
    [...]
    loader.add_value('stats', self.getStats(response))
    [...]

def getStats(self, response):
    statsLoader = StatsItemLoader(response=response)
    statsLoader.add_xpath('comments_count', '//div[@class=\'btn-count\']//a/text()')
    statsLoader.add_value('views_count', '42')
    return dict(statsLoader.load_item())

最初它不起作用,因为我的 input_processor 是用于 stats 字段的 MapCompose(remove_tags).为了序列化对象,您必须return dict(loader.load_item()) 而不仅仅是return loader.load_item()

Originally it was not working because my input_processor was MapCompose(remove_tags) for the stats field. In order to serialize the object you have to return dict(loader.load_item()) and not just return loader.load_item()

谢谢!

这篇关于scrapy:使用 itemLoader 填充嵌套项目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆