Scrapy Pipeline-不可哈希类型列表 [英] Scrapy Pipeline - unhashable type list

查看:121
本文介绍了Scrapy Pipeline-不可哈希类型列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试创建一个蜘蛛,从一个域中获取所有URL,并创建域名记录以及该域上所有URL的所有标头.这是先前

I am trying to create a spider that fetches all the urls from one domain and create a record of the domain name and all the headers across the urls on this domain. This is a continuation of a previous question.

我设法获得帮助,并且了解到我需要在scrapy框架中使用Item管道来实现这一目标.我在存储域名的item-pipeline中创建了一个dict/hash,并附加了所有标题.

I managed to get help, and understand that I need to use Item pipeline in the scrapy framework to achieve this. I create a dict/hash in the items-pipeline where I store domain name and append all the headers.

我收到的错误是:不可散列的类型列表"

The error I receive is: unhashable type 'list'

spider.py

spider.py

class MySpider(CrawlSpider):
    name = 'Webcrawler'
    allowed_domains = ['web.aitp.se']
    start_urls = ['http://web.aitp.se/']

    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(SgmlLinkExtractor(), callback='parse_item'),  
        )

    def parse_item(self, response):
        domain=response.url.split("/")[2] 
        xpath = HtmlXPathSelector(response)
        loader = XPathItemLoader(item=WebsiteItem(), response=response)
        loader.add_value('domain',domain)
        loader.add_xpath('h1',("//h1/text()"))
        yield loader.load_item()

pipelines.py

pipelines.py

# Define your item pipelines here
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
    from scrapy.exceptions import DropItem
    from scrapy.http import Request
    from Prospecting.items import WebsiteItem
    from collections import defaultdict

class DomainPipeline(object):
    global Accumulator 
    Accumulator = defaultdict(list)

    def process_item(self, item, spider):
        Accumulator[ item['domain'] ].append( item['h1'] )

    def close_spider(spider):
        yield Accumulator.items()

我试图解决这个问题,只是从csv文件中读取域和标头,然后将其合并为一条记录,效果很好.

I tried to break down the problem, and just read domains and headers from a csv-file and merge this into one record and this works fine.

from collections import defaultdict
Accumulator = defaultdict(list)
companies= open('test.csv','r')

for line in companies:

    fields=line.split(',')
    Accumulator[ fields[0] ].append(fields[1])

print Accumulator.items()

推荐答案

在python中,列表不能用作字典中的键. dict键必须是可哈希的(通常意味着键必须是不可变的)

In python, a list cannot be used as key in a dict. The dict keys need to be hashable (which usually means that keys need to be immutable)

因此,如果您在任何地方使用列表,则可以在将其添加到字典之前将其转换为元组. tuple(mylist)应该足以将列表转换为元组.

So, if there is any place where you are using lists, you can convert it into a tuple before adding to a dict. tuple(mylist) should be good enough to convert the list to a tuple.

这篇关于Scrapy Pipeline-不可哈希类型列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆