Scrapy 在使用 crawlerprocess 运行时抛出错误 [英] Scrapy throws an error when run using crawlerprocess
问题描述
我使用scrapy在python中编写了一个脚本来从网站收集不同帖子的名称及其链接.当我从命令行执行我的脚本时,它可以完美运行.现在,我的目的是使用 CrawlerProcess()
运行脚本.我在不同的地方寻找类似的问题,但找不到任何直接的解决方案或任何更接近的解决方案.但是,当我尝试按原样运行它时,出现以下错误:
I've written a script in python using scrapy to collect the name of different posts and their links from a website. When I execute my script from command line it works flawlessly. Now, my intention is to run the script using CrawlerProcess()
. I look for the similar problems in different places but nowhere I could find any direct solution or anything closer to that. However, when I try to run it as it is I get the following error:
从 stackoverflow.items 导入 StackoverflowItemModuleNotFoundError: 没有名为stackoverflow"的模块
from stackoverflow.items import StackoverflowItem ModuleNotFoundError: No module named 'stackoverflow'
到目前为止,这是我的脚本 (stackoverflowspider.py
):
This is my script so far (stackoverflowspider.py
):
from scrapy.crawler import CrawlerProcess
from stackoverflow.items import StackoverflowItem
from scrapy import Selector
import scrapy
class stackoverflowspider(scrapy.Spider):
name = 'stackoverflow'
start_urls = ['https://stackoverflow.com/questions/tagged/web-scraping']
def parse(self,response):
sel = Selector(response)
items = []
for link in sel.xpath("//*[@class='question-hyperlink']"):
item = StackoverflowItem()
item['name'] = link.xpath('.//text()').extract_first()
item['url'] = link.xpath('.//@href').extract_first()
items.append(item)
return items
if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(stackoverflowspider)
c.start()
items.py
包括:
import scrapy
class StackoverflowItem(scrapy.Item):
name = scrapy.Field()
url = scrapy.Field()
这是树:点击查看层次结构
我知道我可以通过这种方式取得成功,但我只想用我上面尝试过的方式完成任务:
I know I can bring up success this way but I am only interested to accomplish the task with the way I tried above:
def parse(self,response):
for link in sel.xpath("//*[@class='question-hyperlink']"):
name = link.xpath('.//text()').extract_first()
url = link.xpath('.//@href').extract_first()
yield {"Name":name,"Link":url}
推荐答案
虽然@Dan-Dev 向我展示了一条通往正确方向的道路,但我还是决定提供一个对我来说完美无缺的完整解决方案.
Although @Dan-Dev showed me a way to the right direction, I decided to provide a complete solution which worked for me flawlessly.
除了我在下面粘贴的内容外,其他任何地方都没有改变:
With changing nothing anywhere other than what I'm pasting below:
import sys
#The following line (which leads to the folder containing "scrapy.cfg") fixed the problem
sys.path.append(r'C:\Users\WCS\Desktop\stackoverflow')
from scrapy.crawler import CrawlerProcess
from stackoverflow.items import StackoverflowItem
from scrapy import Selector
import scrapy
class stackoverflowspider(scrapy.Spider):
name = 'stackoverflow'
start_urls = ['https://stackoverflow.com/questions/tagged/web-scraping']
def parse(self,response):
sel = Selector(response)
items = []
for link in sel.xpath("//*[@class='question-hyperlink']"):
item = StackoverflowItem()
item['name'] = link.xpath('.//text()').extract_first()
item['url'] = link.xpath('.//@href').extract_first()
items.append(item)
return items
if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(stackoverflowspider)
c.start()
再次,在脚本中包含以下内容修复了问题
import sys
#The following line (which leads to the folder containing "scrapy.cfg") fixed the problem
sys.path.append(r'C:\Users\WCS\Desktop\stackoverflow')
这篇关于Scrapy 在使用 crawlerprocess 运行时抛出错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!