创建一个通用的scrapy蜘蛛 [英] Creating a generic scrapy spider

查看:54
本文介绍了创建一个通用的scrapy蜘蛛的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题实际上是如何做与上一个问题相同的事情,但在 Scrapy 0.14 中.

My question is really how to do the same thing as a previous question, but in Scrapy 0.14.

为多个网站使用一个 Scrapy 蜘蛛

基本上,我有一个 GUI,它接受域、关键字、标签名称等参数,我想创建一个通用蜘蛛来抓取这些域中的那些标签中的关键字.我使用旧版本的scrapy,通过覆盖蜘蛛管理器类或动态创建蜘蛛,阅读了相互矛盾的内容.哪种方法是首选,我如何实施和调用正确的解决方案?提前致谢.

Basically, I have GUI that takes parameters like domain, keywords, tag names, etc. and I want to create a generic spider to crawl those domains for those keywords in those tags. I've read conflicting things, using older versions of scrapy, by either overriding the spider manager class or by dynamically creating a spider. Which method is preferred and how do I implement and invoke the proper solution? Thanks in advance.

这是我想要通用的代码.它还使用 BeautifulSoup.我把它配对了,所以希望没有删除任何对理解它至关重要的东西.

Here is the code that I want to make generic. It also uses BeautifulSoup. I paired it down so hopefully didn't remove anything crucial to understand it.

class MySpider(CrawlSpider):

name = 'MySpider'
allowed_domains = ['somedomain.com', 'sub.somedomain.com']
start_urls = ['http://www.somedomain.com']

rules = (
    Rule(SgmlLinkExtractor(allow=('/pages/', ), deny=('', ))),

    Rule(SgmlLinkExtractor(allow=('/2012/03/')), callback='parse_item'),
)

def parse_item(self, response):
    contentTags = []

    soup = BeautifulSoup(response.body)

    contentTags = soup.findAll('p', itemprop="myProp")

    for contentTag in contentTags:
        matchedResult = re.search('Keyword1|Keyword2', contentTag.text)
        if matchedResult:
            print('URL Found: ' + response.url)

    pass

推荐答案

我使用 Scrapy Extensions 方法将Spider 类扩展到一个名为 Masterspider 的类,其中包含一个通用解析器.

I use the Scrapy Extensions approach to extend the Spider class to a class named Masterspider that includes a generic parser.

下面是我的通用扩展解析器的简短"版本.请注意,您需要在开始工作后立即使用 Javascript 引擎(例如 Selenium 或 BeautifulSoup)实现渲染器在使用 AJAX 的页面上.还有很多额外的代码来管理站点之间的差异(基于列标题的废料、处理相对和长 URL、管理不同类型的数据容器等......).

Below is the very "short" version of my generic extended parser. Note that you'll need to implement a renderer with a Javascript engine (such as Selenium or BeautifulSoup) a as soon as you start working on pages using AJAX. And a lot of additional code to manage differences between sites (scrap based on column title, handle relative vs long URL, manage different kind of data containers, etc...).

Scrapy 扩展方法的有趣之处在于,如果某些东西不适合,您仍然可以覆盖通用解析器方法,但我从来没有必要这样做.Masterspider 类检查是否在特定于站点的蜘蛛类下创建了某些方法(例如 parser_start、next_url_parser...)以允许特定的管理:发送表单、从页面中的元素构造 next_url 请求等.

What is interresting with the Scrapy Extension approach is that you can still override the generic parser method if something does not fit but I never had to. The Masterspider class checks if some methods have been created (eg. parser_start, next_url_parser...) under the site specific spider class to allow the management of specificies: send a form, construct the next_url request from elements in the page, etc.

当我抓取非常不同的网站时,总是有特殊性需要管理.这就是为什么我更喜欢为每个抓取的站点保留一个类,以便我可以编写一些特定的方法来处理它(预处理/后处理,除了流水线、请求生成器......).

As I'm scraping very different sites, there's always specificities to manage. That's why I prefer to keep a class for each scraped site so that I can write some specific methods to handle it (pre-/post-processing except PipeLines, Request generators...).

masterspider/sitespider/settings.py

masterspider/sitespider/settings.py

EXTENSIONS = {
    'masterspider.masterspider.MasterSpider': 500
}

masterspider/masterspdier/masterspider.py

masterspider/masterspdier/masterspider.py

# -*- coding: utf8 -*-
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from sitespider.items import genspiderItem

class MasterSpider(Spider):

    def start_requests(self):
        if hasattr(self,'parse_start'): # First page requiring a specific parser
            fcallback = self.parse_start
        else:
            fcallback = self.parse
        return [ Request(self.spd['start_url'],
                     callback=fcallback,
                     meta={'itemfields': {}}) ]

    def parse(self, response):
        sel = Selector(response)
        lines = sel.xpath(self.spd['xlines'])
        # ...
        for line in lines:
            item = genspiderItem(response.meta['itemfields'])               
            # ...
            # Get request_url of detailed page and scrap basic item info
            # ... 
            yield  Request(request_url,
                   callback=self.parse_item,
                   meta={'item':item, 'itemfields':response.meta['itemfields']})

        for next_url in sel.xpath(self.spd['xnext_url']).extract():
            if hasattr(self,'next_url_parser'): # Need to process the next page URL before?
                yield self.next_url_parser(next_url, response)
            else:
                yield Request(
                    request_url,
                    callback=self.parse,
                    meta=response.meta)

    def parse_item(self, response):
        sel = Selector(response)
        item = response.meta['item']
        for itemname, xitemname in self.spd['x_ondetailpage'].iteritems():
            item[itemname] = "\n".join(sel.xpath(xitemname).extract())
        return item

masterspider/sitespider/spiders/somesite_spider.py

masterspider/sitespider/spiders/somesite_spider.py

# -*- coding: utf8 -*-
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from sitespider.items import genspiderItem
from masterspider.masterspider import MasterSpider

class targetsiteSpider(MasterSpider):
    name = "targetsite"
    allowed_domains = ["www.targetsite.com"]
    spd = {
        'start_url' : "http://www.targetsite.com/startpage", # Start page
        'xlines' : "//td[something...]",
        'xnext_url' : "//a[contains(@href,'something?page=')]/@href", # Next pages
        'x_ondetailpage' : {
            "itemprop123" :      u"id('someid')//text()"
            }
    }

#     def next_url_parser(self, next_url, response): # OPTIONAL next_url regexp pre-processor
#          ...

这篇关于创建一个通用的scrapy蜘蛛的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆