Scrapy Media Pipeline,文件未下载 [英] Scrapy Media Pipeline ,files not downloading
本文介绍了Scrapy Media Pipeline,文件未下载的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我是 Scrapy 的新手.我正在尝试使用媒体管道下载文件.但是当我运行蜘蛛时,文件夹中没有存储任何文件.
I am new to Scrapy . I am trying to download files using media pipeline. But when I am running spider no files are stored in the folder.
蜘蛛:
import scrapy
from scrapy import Request
from pagalworld.items import PagalworldItem
class JobsSpider(scrapy.Spider):
name = "songs"
allowed_domains = ["pagalworld.me"]
start_urls =['https://pagalworld.me/category/11598/Latest%20Bollywood%20Hindi%20Mp3%20Songs%20-%202017.html']
def parse(self, response):
urls = response.xpath('//div[@class="pageLinkList"]/ul/li/a/@href').extract()
for link in urls:
yield Request(link, callback=self.parse_page, )
def parse_page(self, response):
songName=response.xpath('//li/b/a/@href').extract()
for song in songName:
yield Request(song,callback=self.parsing_link)
def parsing_link(self,response):
item= PagalworldItem()
item['file_urls']=response.xpath('//div[@class="menu_row"]/a[@class="touch"]/@href').extract()
yield{"download_link":item['file_urls']}
项目文件:
import scrapy
class PagalworldItem(scrapy.Item):
file_urls=scrapy.Field()
设置文件:
BOT_NAME = 'pagalworld'
SPIDER_MODULES = ['pagalworld.spiders']
NEWSPIDER_MODULE = 'pagalworld.spiders'
ROBOTSTXT_OBEY = True
CONCURRENT_REQUESTS = 5
DOWNLOAD_DELAY = 3
ITEM_PIPELINES = {
'scrapy.pipelines.files.FilesPipeline': 1
}
FILES_STORE = '/tmp/media/'
输出如下所示:
推荐答案
def parsing_link(self,response):
item= PagalworldItem()
item['file_urls']=response.xpath('//div[@class="menu_row"]/a[@class="touch"]/@href').extract()
yield{"download_link":item['file_urls']}
你正在屈服:
yield {"download_link": ['http://someurl.com']}
scrapy 的媒体/文件管道在何处工作,您需要生成包含 file_urls
字段的项目.所以试试这个:
where for scrapy's Media/File pipeline to work you need to yield and item that contains file_urls
field. So try this instead:
def parsing_link(self,response):
item= PagalworldItem()
item['file_urls']=response.xpath('//div[@class="menu_row"]/a[@class="touch"]/@href').extract()
yield item
这篇关于Scrapy Media Pipeline,文件未下载的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文