Portia/Scrapy - 如何替换或添加值以输出 JSON [英] Portia/Scrapy - how to replace or add values to output JSON

查看:30
本文介绍了Portia/Scrapy - 如何替换或添加值以输出 JSON的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

只有 2 个快速的疑问:

just 2 quick doubts:

1- 我希望我的最终 JSON 文件替换文本提取(例如提取的文本是添加到购物车,但我想在我的最终 JSON 中更改为 IN STOCK.这可能吗?

1- I want my final JSON file to replace the text extract (for example text extracted is ADD TO CART but I want to change to IN STOCK in my final JSON. Is it possible?

2- 我还想将一些自定义数据添加到网站中没有的最终 JSON 文件中,例如商店名称"...这样我抓取的每个产品后面都会有商店名称.可能吗?

2- I also would like to add some custom data to my final JSON file that is not in the website, for example "Store name"... so every product that I scrape will have the store name after it. Is it possible?

我同时使用 Portia 和 Scrapy,因此在这两个平台上都欢迎您提出建议.

I am using both Portia and Scrapy so your suggestions are welcome in both platforms.

我的 Scrapy 蜘蛛代码如下:

My Scrapy spider code is below:

import scrapy
from __future__ import absolute_import
from scrapy import Request
from scrapy.linkextractors import LinkExtractor
from scrapy.loader import ItemLoader
from scrapy.loader.processors import Identity
from scrapy.spiders import Rule
from ..utils.spiders import BasePortiaSpider
from ..utils.starturls import FeedGenerator, FragmentGenerator
from ..utils.processors import Item, Field, Text, Number, Price, Date, Url, 
Image, Regex
from ..items import PortiaItem


class Advent(BasePortiaSpider):
    name = "advent"
    allowed_domains = [u'www.adventgames.com.au']
    start_urls = [u'http://www.adventgames.com.au/c/4504822/1/all-games-a---k.html',
                  {u'url': u'http://www.adventgames.com.au/Listing/Category/?categoryId=4504822&page=[1-5]',
                   u'fragments': [{u'valid': True,
                                   u'type': u'fixed',
                                   u'value': u'http://www.adventgames.com.au/Listing/Category/?categoryId=4504822&page='},
                                  {u'valid': True,
                                   u'type': u'range',
                                   u'value': u'1-5'}],
                   u'type': u'generated'}]
    rules = [
        Rule(
            LinkExtractor(
                allow=('.*'),
                deny=()
            ),
            callback='parse_item',
            follow=True
        )
    ]
    items = [
        [
            Item(
                PortiaItem,
                None,
                u'.DataViewCell > form > table',
                [
                    Field(
                        u'Title',
                        'tr:nth-child(1) > td > .DataViewItemProductTitle > a *::text',
                        []),
                    Field(
                        u'Price',
                        'tr:nth-child(1) > td > .DataViewItemOurPrice *::text',
                        []),
                    Field(
                        u'Img_src',
                        'tr:nth-child(1) > td > .DataViewItemThumbnailImage > div > a > img::attr(src)',
                        []),
                    Field(
                        u'URL',
                        'tr:nth-child(1) > td > .DataViewItemProductTitle > a::attr(href)',
                        []),
                    Field(
                        u'Stock',
                        'tr:nth-child(2) > td > .DataViewItemAddToCart > .wButton::attr(value)',
                        [])])]]

推荐答案

我从未使用过 items 类变量,它看起来非常不可读且难以理解.

I have never used the items class variable, it looks very unreadable and difficult to understand.

我建议你有一个回调方法并像这样解析它

I would suggest you to have a callback method and parse it like this

def my_callback_func(self, response):

    myitem = PortiaItem()


    for item in response.css(".DataViewCell > form > table"):

        item['Title'] = item.css('tr:nth-child(1) > td > .DataViewItemProductTitle > a *::text').extract_first()

        item['Stock'] = item.css('tr:nth-child(2) > td > .DataViewItemAddToCart > .wButton::attr(value)').extract_first()

        if item['Stock'] == "ADD TO CART":

            item['is_available'] = "YES"

        ...... and so on

        yield item

这篇关于Portia/Scrapy - 如何替换或添加值以输出 JSON的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆