Scrapy Pipeline-不可哈希类型列表 [英] Scrapy Pipeline - unhashable type list

查看：121 发布时间：2020/5/28 0:48:14 python scrapy pipeline

本文介绍了Scrapy Pipeline-不可哈希类型列表的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试创建一个蜘蛛，从一个域中获取所有URL，并创建域名记录以及该域上所有URL的所有标头.这是先前

I am trying to create a spider that fetches all the urls from one domain and create a record of the domain name and all the headers across the urls on this domain. This is a continuation of a previous question.

我设法获得帮助，并且了解到我需要在scrapy框架中使用Item管道来实现这一目标.我在存储域名的item-pipeline中创建了一个dict/hash，并附加了所有标题.

I managed to get help, and understand that I need to use Item pipeline in the scrapy framework to achieve this. I create a dict/hash in the items-pipeline where I store domain name and append all the headers.

我收到的错误是:不可散列的类型列表"

The error I receive is: unhashable type 'list'

spider.py

class MySpider(CrawlSpider):
    name = 'Webcrawler'
    allowed_domains = ['web.aitp.se']
    start_urls = ['http://web.aitp.se/']

    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(SgmlLinkExtractor(), callback='parse_item'),  
        )

    def parse_item(self, response):
        domain=response.url.split("/")[2] 
        xpath = HtmlXPathSelector(response)
        loader = XPathItemLoader(item=WebsiteItem(), response=response)
        loader.add_value('domain',domain)
        loader.add_xpath('h1',("//h1/text()"))
        yield loader.load_item()

pipelines.py

# Define your item pipelines here
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
    from scrapy.exceptions import DropItem
    from scrapy.http import Request
    from Prospecting.items import WebsiteItem
    from collections import defaultdict

class DomainPipeline(object):
    global Accumulator 
    Accumulator = defaultdict(list)

    def process_item(self, item, spider):
        Accumulator[ item['domain'] ].append( item['h1'] )

    def close_spider(spider):
        yield Accumulator.items()

我试图解决这个问题，只是从csv文件中读取域和标头，然后将其合并为一条记录，效果很好.

I tried to break down the problem, and just read domains and headers from a csv-file and merge this into one record and this works fine.

from collections import defaultdict
Accumulator = defaultdict(list)
companies= open('test.csv','r')

for line in companies:

    fields=line.split(',')
    Accumulator[ fields[0] ].append(fields[1])

print Accumulator.items()

Scrapy Pipeline-不可哈希类型列表 [英] Scrapy Pipeline - unhashable type list

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Scrapy Pipeline-不可哈希类型列表 [英] Scrapy Pipeline - unhashable type list

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭