如何访问Scrapy CrawlSpider中的特定start_url? [英] How to access a specific start_url in a Scrapy CrawlSpider?

查看:180
本文介绍了如何访问Scrapy CrawlSpider中的特定start_url?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Scrapy,特别是Scrapy的 CrawlSpider 类来抓取包含某些关键字的网络链接。我有一个很长的 start_urls 列表,它从连接到Django项目的SQLite数据库中获取其条目。我想保存这个数据库中的被卷入的Web链接。

I'm using Scrapy, in particular Scrapy's CrawlSpider class to scrape web links which contain certain keywords. I have a pretty long start_urls list which gets its entries from a SQLite database which is connected to a Django project. I want to save the scraped web links in this database.

我有两个Django模型,一个是开始的URL,例如 http:// example.com ,另一个用于剪贴的网页链接,例如 http://example.com/website1 http: //example.com/website2 等等。所有被抓取的网络链接都是 start_urls 列表中的一个起始网址的子网站。

I have two Django models, one for the start urls such as http://example.com and one for the scraped web links such as http://example.com/website1, http://example.com/website2 etc. All scraped web links are subsites of one of the start urls in the start_urls list.

网页链接模型与起始网址模型具有多对一的关系,即网络链接模型对起始网址模型有一个Foreignkey。为了将我的被刮的web链接保存到数据库,我需要告诉 CrawlSpider parse_item()开始网页链接所属的方法。我怎样才能做到这一点? Scrapy的 DjangoItem 类在这方面没有帮助,因为我还需要明确定义使用的开始网址。

The web links model has a many-to-one relation to the start url model, i.e. the web links model has a Foreignkey to the start urls model. In order to save my scraped web links properly to the database, I need to tell the CrawlSpider's parse_item() method which start url the scraped web link belongs to. How can I do that? Scrapy's DjangoItem class does not help in this respect as I still have to define the used start url explicitly.

换句话说,我如何将当前使用的开始网址传递给 parse_item()方法,以便我可以将其与保存到数据库的适当的被刮取的Web链接一起保存。有任何想法吗?提前致谢!

In other words, how can I pass the currently used start url to the parse_item() method, so that I can save it together with the appropriate scraped web links to the database? Any ideas? Thanks in advance!

推荐答案

默认情况下,您无法访问原始的起始网址。

By default you can not access the original start url.

但是您可以覆盖 make_requests_from_url 方法,并将start url放入 。然后在一个解析中,你可以从那里提取它(如果你在后面的请求中得到解析方法,不要忘记在其中转发这个开始的URL)。

But you can override make_requests_from_url method and put the start url into a meta. Then in a parse you can extract it from there (if you yield in that parse method subsequent requests, don't forget to forward that start url in them).

我没有使用 CrawlSpider ,也许Maxim建议可以为您工作,但请记住, response.url 在可能的重定向之后有url。

I haven't worked with CrawlSpider and maybe what Maxim suggests will work for you, but keep in mind that response.url has the url after possible redirections.

这是一个我将如何做的例子,但这只是一个例子没有测试:

Here is an example of how i would do it, but it's just an example (taken from the scrapy tutorial) and was not tested:

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        Rule(SgmlLinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(SgmlLinkExtractor(allow=('item\.php', )), callback='parse_item'),
    )

    def parse(self, response): # When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.
        for request_or_item in CrawlSpider.parse(self, response):
            if isinstance(request_or_item, Request):
                request_or_item = request_or_item.replace(meta = {'start_url': response.meta['start_url']})
            yield request_or_item

    def make_requests_from_url(self, url):
        """A method that receives a URL and returns a Request object (or a list of Request objects) to scrape. 
        This method is used to construct the initial requests in the start_requests() method, 
        and is typically used to convert urls to requests.
        """
        return Request(url, dont_filter=True, meta = {'start_url': url})

    def parse_item(self, response):
        self.log('Hi, this is an item page! %s' % response.url)

        hxs = HtmlXPathSelector(response)
        item = Item()
        item['id'] = hxs.select('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
        item['name'] = hxs.select('//td[@id="item_name"]/text()').extract()
        item['description'] = hxs.select('//td[@id="item_description"]/text()').extract()
        item['start_url'] = response.meta['start_url']
        return item

询问是否有任何问题。 BTW,使用PyDev的转到定义功能,您可以看到scrapy源,并了解什么参数请求 make_requests_from_url 等课堂和方法期待。开始执行代码可以帮助您并节省时间,尽管看起来似乎很困难。

Ask if you have any questions. BTW, using PyDev's 'Go to definition' feature you can see scrapy sources and understand what parameters Request, make_requests_from_url and other classes and methods expect. Getting into the code helps and saves you time, though it might seem difficult at the beginning.

这篇关于如何访问Scrapy CrawlSpider中的特定start_url?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆