如何按自定义顺序对scrapy项目信息进行排序? [英] How to sort the scrapy item info in customized order?

查看:195
本文介绍了如何按自定义顺序对scrapy项目信息进行排序?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

scrapy 中的默认顺序是字母,我已经阅读了一些使用 OrderedDict 以自定义顺序输出项目的帖子.
我写了一个蜘蛛跟随网页.
如何获取 Scrapy 项中的字段顺序

The default order in scrapy is alphabet,i have read some post to use OrderedDict to output item in customized order.
I write a spider follow the webpage.
How to get order of fields in Scrapy item

我的物品.py.

import scrapy
from collections import OrderedDict


class OrderedItem(scrapy.Item):
    def __init__(self, *args, **kwargs):
        self._values = OrderedDict()
        if args or kwargs:  
            for k, v in six.iteritems(dict(*args, **kwargs)):
                self[k] = v

class StockinfoItem(OrderedItem):
    name = scrapy.Field()
    phone = scrapy.Field()
    address = scrapy.Field()

简单的蜘蛛文件.

import scrapy
from info.items import InfoItem

class InfoSpider(scrapy.Spider):
    name = 'Info'
    allowed_domains = ['quotes.money.163.com']
    start_urls = [ "http://quotes.money.163.com/f10/gszl_600023.html"]
    def parse(self, response):
        item = InfoItem()
        item["name"] = response.xpath('/html/body/div[2]/div[4]/table/tr[2]/td[2]/text()').extract()
        item["phone"] = response.xpath('/html/body/div[2]/div[4]/table/tr[7]/td[4]/text()').extract()
        item["address"] = response.xpath('/html/body/div[2]/div[4]/table/tr[2]/td[4]/text()').extract()
        item.items()
        yield  item

何时运行蜘蛛的爬虫信息.

The scrapy info when to run the spider.

2019-04-25 13:45:01 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.money.163.com/f10/gszl_600023.html>
{'address': ['浙江省杭州市天目山路152号浙能大厦'],'name': ['浙能电力'],'phone': ['0571-87210223']}

为什么我无法获得如下所需的订单?

Why i can't get such desired order as below?

{'name': ['浙能电力'],'phone': ['0571-87210223'],'address': ['浙江省杭州市天目山路152号浙能大厦']}

感谢 Gallaecio 的建议,在 settings.py 中添加以下内容.

Thank for Gallaecio's advice, to add the following in settings.py.

FEED_EXPORT_FIELDS=['name','phone','address']

执行spider并输出到csv文件.

Execute the spider and output to csv file.

scrapy crawl  info -o  info.csv

现场顺序是我自定义的顺序.

The field order is in my customized order.

cat info.csv
name,phone,address
浙能电力,0571-87210223,浙江省杭州市天目山路152号浙能大

查看scrapy的调试信息:

Look at the scrapy's debug info :

2019-04-26 00:16:38 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.money.163.com/f10/gszl_600023.html>
{'address': ['浙江省杭州市天目山路152号浙能大厦'],
 'name': ['浙能电力'],
 'phone': ['0571-87210223']}

如何按自定义顺序制作调试信息?如何获得以下调试输出?

How can i make the debug info in customized order?How to get the following debug output?

2019-04-26 00:16:38 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.money.163.com/f10/gszl_600023.html>
{'name': ['浙能电力'],
 'phone': ['0571-87210223'],
 'address': ['浙江省杭州市天目山路152号浙能大厦'],}

推荐答案

问题出在Item__repr__函数中.原来它的代码是:

Problem is in __repr__ function of Item. Originally its code is:

def __repr__(self):
    return pformat(dict(self))

因此,即使您将项目转换为 OrderedDict 并期望字段以相同的顺序保存,此函数也会对其应用 dict() 并打破顺序.

So even if you convert your item to OrderedDict and expect fields to be saved in the same order, this function applies dict() to it and breaks the order.

所以,我建议你以你喜欢的方式重载它,例如:

So, I propose you to overload it in the way you like, for example:

import json

class OrderedItem(scrapy.Item):
    def __init__(self, *args, **kwargs):
        self._values = OrderedDict()
        if args or kwargs:
            for k, v in six.iteritems(dict(*args, **kwargs)):
                self[k] = v

    def __repr__(self):
        return json.dumps(OrderedDict(self), ensure_ascii = False)  # it should return some string

现在你可以得到这个输出:

And now you can get this output:

2019-04-30 18:56:20 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.money.163.com/f10/gszl_600023.html>
{"name": ["\u6d59\u80fd\u7535\u529b"], "phone": ["0571-87210223"], "address": ["\u6d59\u6c5f\u7701\u676d\u5dde\u5e02\u5929\u76ee\u5c71\u8def152\u53f7\u6d59\u80fd\u5927\u53a6"]}

这篇关于如何按自定义顺序对scrapy项目信息进行排序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆