使用 Scrapy 抓取 Python 数据 [英] Python data scraping with Scrapy

查看:54
本文介绍了使用 Scrapy 抓取 Python 数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从具有文本字段、按钮等的网站抓取数据.我的要求是填写文本字段并提交表单以获取结果,然后从结果页面抓取数据点.

I want to scrape data from a website which has TextFields, Buttons etc.. and my requirement is to fill the text fields and submit the form to get the results and then scrape the data points from results page.

我想知道 Scrapy 有没有这个功能或者有没有人可以推荐一个 Python 库来完成这个任务?

I want to know that does Scrapy has this feature or If anyone can recommend a library in Python to accomplish this task?

(已编辑)
我想从以下网站抓取数据:
http://a836-acris.nyc.gov/DS/DocumentSearch/DocumentType

我的要求是从 ComboBoxes 中选择值并点击搜索按钮并从结果页面中抓取数据点.

My requirement is to select the values from ComboBoxes and hit the search button and scrape the data points from the result page.

附言我正在使用 selenium Firefox 驱动程序从其他网站抓取数据,但该解决方案并不好,因为 selenium Firefox 驱动程序依赖于 FireFox 的 EXE,即必须在运行刮刀之前安装 Firefox.

Selenium Firefox 驱动程序为一个实例消耗大约 100MB 内存,我的要求是一次运行大量实例以加快抓取过程,因此也存在内存限制.

Firefox 在执行scraper 时有时会崩溃,不知道为什么.此外,我需要少窗口抓取,这在 Selenium Firefox 驱动程序的情况下是不可能的.

我的最终目标是在 Heroku 上运行刮刀,并且我在那里有 Linux 环境,因此 selenium Firefox 驱动程序无法在 Heroku 上运行.谢谢

P.S. I'm using selenium Firefox driver to scrape data from some other website but that solution is not good because selenium Firefox driver is dependent on FireFox's EXE i.e Firefox must be installed before running the scraper.

Selenium Firefox driver is consuming around 100MB memory for one instance and my requirement is to run a lot of instances at a time to make the scraping process quick so there is memory limitation as well.

Firefox crashes sometimes during the execution of scraper, don't know why. Also I need window less scraping which is not possible in case of Selenium Firefox driver.

My ultimate goal is to run the scrapers on Heroku and I have Linux environment over there so selenium Firefox driver won't work on Heroku. Thanks

推荐答案

基本上,您有很多工具可供选择:

Basically, you have plenty of tools to choose from:

这些工具有不同的用途,但可以根据任务混合使用.

These tools have different purposes but they can be mixed together depending on the task.

Scrapy 是一款功能强大且非常智能的工具,用于抓取网站、提取数据.但是,当涉及到操作页面时:点击按钮、填写表单 - 它变得更加复杂:

Scrapy is a powerful and very smart tool for crawling web-sites, extracting data. But, when it comes to manipulating the page: clicking buttons, filling forms - it becomes more complicated:

  • 有时,通过直接在 scrapy 中进行底层表单操作来模拟填写/提交表单很容易
  • 有时,您必须使用其他工具来帮助抓取 - 例如机械化或硒

如果您提出更具体的问题,将有助于了解您应该使用或选择哪种工具.

If you make your question more specific, it'll help to understand what kind of tools you should use or choose from.

看一个有趣的 scrapy&selenium 混合示例.在这里,selenium 的任务是点击按钮并为scrapy 项目提供数据:

Take a look at an example of interesting scrapy&selenium mix. Here, selenium task is to click the button and provide data for scrapy items:

import time
from scrapy.item import Item, Field

from selenium import webdriver

from scrapy.spider import BaseSpider


class ElyseAvenueItem(Item):
    name = Field()


class ElyseAvenueSpider(BaseSpider):
    name = "elyse"
    allowed_domains = ["ehealthinsurance.com"]
    start_urls = [
    'http://www.ehealthinsurance.com/individual-family-health-insurance?action=changeCensus&census.zipCode=48341&census.primary.gender=MALE&census.requestEffectiveDate=06/01/2013&census.primary.month=12&census.primary.day=01&census.primary.year=1971']

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        self.driver.get(response.url)
        el = self.driver.find_element_by_xpath("//input[contains(@class,'btn go-btn')]")
        if el:
            el.click()

        time.sleep(10)

        plans = self.driver.find_elements_by_class_name("plan-info")
        for plan in plans:
            item = ElyseAvenueItem()
            item['name'] = plan.find_element_by_class_name('primary').text
            yield item

        self.driver.close()

更新:

以下是如何在您的情况下使用scrapy的示例:

Here's an example on how to use scrapy in your case:

from scrapy.http import FormRequest
from scrapy.item import Item, Field
from scrapy.selector import HtmlXPathSelector

from scrapy.spider import BaseSpider


class AcrisItem(Item):
    borough = Field()
    block = Field()
    doc_type_name = Field()


class AcrisSpider(BaseSpider):
    name = "acris"
    allowed_domains = ["a836-acris.nyc.gov"]
    start_urls = ['http://a836-acris.nyc.gov/DS/DocumentSearch/DocumentType']


    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        document_classes = hxs.select('//select[@name="combox_doc_doctype"]/option')

        form_token = hxs.select('//input[@name="__RequestVerificationToken"]/@value').extract()[0]
        for document_class in document_classes:
            if document_class:
                doc_type = document_class.select('.//@value').extract()[0]
                doc_type_name = document_class.select('.//text()').extract()[0]
                formdata = {'__RequestVerificationToken': form_token,
                            'hid_selectdate': '7',
                            'hid_doctype': doc_type,
                            'hid_doctype_name': doc_type_name,
                            'hid_max_rows': '10',
                            'hid_ISIntranet': 'N',
                            'hid_SearchType': 'DOCTYPE',
                            'hid_page': '1',
                            'hid_borough': '0',
                            'hid_borough_name': 'ALL BOROUGHS',
                            'hid_ReqID': '',
                            'hid_sort': '',
                            'hid_datefromm': '',
                            'hid_datefromd': '',
                            'hid_datefromy': '',
                            'hid_datetom': '',
                            'hid_datetod': '',
                            'hid_datetoy': '', }
                yield FormRequest(url="http://a836-acris.nyc.gov/DS/DocumentSearch/DocumentTypeResult",
                                  method="POST",
                                  formdata=formdata,
                                  callback=self.parse_page,
                                  meta={'doc_type_name': doc_type_name})

    def parse_page(self, response):
        hxs = HtmlXPathSelector(response)

        rows = hxs.select('//form[@name="DATA"]/table/tbody/tr[2]/td/table/tr')
        for row in rows:
            item = AcrisItem()
            borough = row.select('.//td[2]/div/font/text()').extract()
            block = row.select('.//td[3]/div/font/text()').extract()

            if borough and block:
                item['borough'] = borough[0]
                item['block'] = block[0]
                item['doc_type_name'] = response.meta['doc_type_name']

                yield item

将其保存在 spider.py 中并通过 scrapy runspider spider.py -o output.json 运行,然后在 output.json 中运行见:

Save it in spider.py and run via scrapy runspider spider.py -o output.json and in output.json you will see:

{"doc_type_name": "CONDEMNATION PROCEEDINGS ", "borough": "Borough", "block": "Block"}
{"doc_type_name": "CERTIFICATE OF REDUCTION ", "borough": "Borough", "block": "Block"}
{"doc_type_name": "COLLATERAL MORTGAGE ", "borough": "Borough", "block": "Block"}
{"doc_type_name": "CERTIFIED COPY OF WILL ", "borough": "Borough", "block": "Block"}
{"doc_type_name": "CONFIRMATORY DEED ", "borough": "Borough", "block": "Block"}
{"doc_type_name": "CERT NONATTCHMENT FED TAX LIEN ", "borough": "Borough", "block": "Block"}
...

希望有所帮助.

这篇关于使用 Scrapy 抓取 Python 数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆