将值传递给scrapy回调 [英] Pass values into scrapy callback

查看:35
本文介绍了将值传递给scrapy回调的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试开始抓取网站并将其抓取到磁盘,但无法让回调函数按我的意愿工作.

I'm trying to get started crawling and scraping a website to disk but having trouble getting the callback function working as I would like.

下面的代码将访问 start_url 并找到网站上的所有a"标签.对于其中的每一个,它都会进行一个回调,将文本响应保存到磁盘,并使用 crawerItem 来存储有关页面的一些元数据.

The code below will visit the start_url and find all the "a" tags on the site. For each 1 of them it will make a callback which is to save the text response to disk and use the crawerItem to store some metadata about the page.

我希望有人能帮我弄清楚如何通过

I was hoping someone could help me figure out how to pass

  1. 每个回调的唯一 ID,以便在保存文件时将其用作文件名
  2. 传递原始页面的 url,以便可以通过 Items 将其添加到元数据中
  3. 点击子页面上的链接,更深入地了解网站

以下是我目前的代码

import scrapy
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from mycrawler.items import crawlerItem

class CrawlSpider(scrapy.Spider):
    name = "librarycrawler"
    allowed_domains = ["example.com"]
    start_urls = [
        "http://www.example.com"
    ]

    rules = (
    Rule(LinkExtractor(),callback='scrape_page', follow=True)
)

def scrape_page(self,response):
    page_soup = BeautifulSoup(response.body,"html.parser")
    ScrapedPageTitle = page_soup.title.get_text()
    item = LibrarycrawlerItem()
    item['title'] =ScrapedPageTitle
    item['file_urls'] = response.url

    yield item

在 Settings.py

ITEM_PIPELINES = [
    'librarycrawler.files.FilesPipeline',
]
FILES_STORE = 'C:\Documents\Spider\crawler\ExtractedText'   

在 items.py

import scrapy


class LibrarycrawlerItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    Files = scrapy.Field()

推荐答案

我不是 100% 确定,但我认为您无法按照自己的意愿重命名 scrapy 图像文件,scrapy 做到了.

I'm not 100% sure but I think you can't rename the scrapy image files however you want, scrapy does that.

您想要做的事情看起来像是CrawlSpider 而不是Spider 的工作.

What you want to do looks like a job for CrawlSpider instead of Spider.

CrawlSpider 本身会递归地跟踪它在每个页面中找到的每个链接,并且您可以对要删除的页面设置规则.以下是文档.

CrawlSpider by itself follows every link it finds in every page recursively and you can set rules on what pages you want to scrap. Here are the docs.

如果你足够顽固地保留 Spider,你可以在请求中使用元标记来传递项目并在其中保存链接.

If you are stubborn enough to keep Spider you can use the meta tag on requests to pass the items and save links in them.

for link in soup.find_all("a"):
        item=crawlerItem()
        item['url'] = response.urljoin(link.get('href'))
        request=scrapy.Request(url,callback=self.scrape_page)
        request.meta['item']=item
        yield request

要获取该项目,只需在响应中查找即可:

To get the item just go look for it in the response:

def scrape_page(self, response):
    item=response.meta['item']

在这个特定示例中,通过 item['url'] 的项目已过时,因为您可以使用 response.url

In this specific example the item passed item['url'] is obsolete as you can get the current url with response.url

还有,

在scrapy 中使用Beautiful Soup 是个坏主意,因为它只会减慢你的速度,scrapy 库的开发非常好,你不需要任何其他东西来提取数据!

It's a bad idea to use Beautiful soup in scrapy as it just slows you down, the scrapy library is really well developed to the extent that you don't need anything else to extract data!

这篇关于将值传递给scrapy回调的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆