Scrapy 在使用 crawlerprocess 运行时抛出错误 [英] Scrapy throws an error when run using crawlerprocess

查看:44
本文介绍了Scrapy 在使用 crawlerprocess 运行时抛出错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用scrapy在python中编写了一个脚本来从网站收集不同帖子的名称及其链接.当我从命令行执行我的脚本时,它可以完美运行.现在,我的目的是使用 CrawlerProcess() 运行脚本.我在不同的地方寻找类似的问题,但找不到任何直接的解决方案或任何更接近的解决方案.但是,当我尝试按原样运行它时,出现以下错误:

I've written a script in python using scrapy to collect the name of different posts and their links from a website. When I execute my script from command line it works flawlessly. Now, my intention is to run the script using CrawlerProcess(). I look for the similar problems in different places but nowhere I could find any direct solution or anything closer to that. However, when I try to run it as it is I get the following error:

从 stackoverflow.items 导入 StackoverflowItemModuleNotFoundError: 没有名为stackoverflow"的模块

from stackoverflow.items import StackoverflowItem ModuleNotFoundError: No module named 'stackoverflow'

到目前为止,这是我的脚本 (stackoverflowspider.py):

This is my script so far (stackoverflowspider.py):

from scrapy.crawler import CrawlerProcess
from stackoverflow.items import StackoverflowItem
from scrapy import Selector
import scrapy

class stackoverflowspider(scrapy.Spider):
    name = 'stackoverflow'
    start_urls = ['https://stackoverflow.com/questions/tagged/web-scraping']

    def parse(self,response):
        sel = Selector(response)
        items = []
        for link in sel.xpath("//*[@class='question-hyperlink']"):
            item = StackoverflowItem()
            item['name'] = link.xpath('.//text()').extract_first()
            item['url'] = link.xpath('.//@href').extract_first()
            items.append(item)
        return items

if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0',   
    })
    c.crawl(stackoverflowspider)
    c.start()

items.py 包括:

import scrapy

class StackoverflowItem(scrapy.Item):
    name = scrapy.Field()
    url = scrapy.Field()

这是树:点击查看层次结构

我知道我可以通过这种方式取得成功,但我只想用我上面尝试过的方式完成任务:

I know I can bring up success this way but I am only interested to accomplish the task with the way I tried above:

def parse(self,response):
    for link in sel.xpath("//*[@class='question-hyperlink']"):
        name = link.xpath('.//text()').extract_first()
        url = link.xpath('.//@href').extract_first()
        yield {"Name":name,"Link":url}

推荐答案

虽然@Dan-Dev 向我展示了一条通往正确方向的道路,但我还是决定提供一个对我来说完美无缺的完整解决方案.

Although @Dan-Dev showed me a way to the right direction, I decided to provide a complete solution which worked for me flawlessly.

除了我在下面粘贴的内容外,其他任何地方都没有改变:

With changing nothing anywhere other than what I'm pasting below:

import sys
#The following line (which leads to the folder containing "scrapy.cfg") fixed the problem
sys.path.append(r'C:\Users\WCS\Desktop\stackoverflow')
from scrapy.crawler import CrawlerProcess
from stackoverflow.items import StackoverflowItem
from scrapy import Selector
import scrapy


class stackoverflowspider(scrapy.Spider):
    name = 'stackoverflow'
    start_urls = ['https://stackoverflow.com/questions/tagged/web-scraping']

    def parse(self,response):
        sel = Selector(response)
        items = []
        for link in sel.xpath("//*[@class='question-hyperlink']"):
            item = StackoverflowItem()
            item['name'] = link.xpath('.//text()').extract_first()
            item['url'] = link.xpath('.//@href').extract_first()
            items.append(item)
        return items

if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0',   
    })
    c.crawl(stackoverflowspider)
    c.start()

再次,在脚本中包含以下内容修复了问题

import sys
#The following line (which leads to the folder containing "scrapy.cfg") fixed the problem
sys.path.append(r'C:\Users\WCS\Desktop\stackoverflow')

这篇关于Scrapy 在使用 crawlerprocess 运行时抛出错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆