Scrapy 教程示例 [英] Scrapy Tutorial Example

查看：29 发布时间：2021/7/16 21:55:59 python web-scraping scrapy web-crawler

本文介绍了Scrapy 教程示例的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

看看是否有人可以为我指明在 Python 中使用 Scrapy 的正确方向.

Looking to see if someone can point me in the right direction in regards to using Scrapy in python.

我已经尝试遵循该示例几天了，但仍然无法获得预期的输出.使用了 Scrapy 教程，http://doc.scrapy.org/en/latest/intro/tutorial.html#defining-our-item，甚至从 github repo 下载一个确切的项目，但我得到的输出不是教程中描述的.

I've been trying to follow the example for several days and still can't get the output expected. Used the Scrapy tutorial, http://doc.scrapy.org/en/latest/intro/tutorial.html#defining-our-item, and even download an exact project from the github repo but the output I get is not of that described in the tutorial.

from scrapy.spiders import Spider
from scrapy.selector import Selector

from dirbot.items import Website


class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
    "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
    "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",
]

  def parse(self, response):
    """
    The lines below is a spider contract. For more info see:
    http://doc.scrapy.org/en/latest/topics/contracts.html

    @url http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/
    @scrapes name
    """
    sel = Selector(response)
    sites = sel.xpath('//ul[@class="directory-url"]/li')
    items = []

    for site in sites:
        item = Website()
        item['name'] = site.xpath('a/text()').extract()
        item['url'] = site.xpath('a/@href').extract()
        item['description'] = site.xpath('text()').re('-\s[^\n]*\\r')
        items.append(item)

    return items

从github下载项目后，在顶层目录运行scrapy crawl dmoz".我得到以下输出:

After I downloaded the project from github, I run "scrapy crawl dmoz" at the top level directory. I get the following output:

2016-08-31 00:08:19 [scrapy] INFO: Scrapy 1.1.1 started (bot: scrapybot)
2016-08-31 00:08:19 [scrapy] INFO: Overridden settings: {'DEFAULT_ITEM_CLASS': 'dirbot.items.Website', 'NEWSPIDER_MODULE': 'dirbot.spiders', 'SPIDER_MODULES': ['dirbot.spiders']}
2016-08-31 00:08:19 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2016-08-31 00:08:19 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-08-31 00:08:19 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-08-31 00:08:19 [scrapy] INFO: Enabled item pipelines:
['dirbot.pipelines.FilterWordsPipeline']
2016-08-31 00:08:19 [scrapy] INFO: Spider opened
2016-08-31 00:08:19 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-08-31 00:08:19 [scrapy] DEBUG: Telnet console listening on 128.1.2.1:2700
2016-08-31 00:08:20 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2016-08-31 00:08:20 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
2016-08-31 00:08:20 [scrapy] INFO: Closing spider (finished)
2016-08-31 00:08:20 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 514,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 16179,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 8, 31, 7, 8, 20, 314625),
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2016, 8, 31, 7, 8, 19, 882944)}
2016-08-31 00:08:20 [scrapy] INFO: Spider closed (finished)

按照教程期待这个:

[scrapy] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
 {'desc': [u' - By David Mertz; Addison Wesley. Book in progress, full text, ASCII format. Asks for feedback. [author website, Gnosis Software, Inc.\n],
  'link': [u'http://gnosis.cx/TPiP/'],
  'title': [u'Text Processing in Python']}
[scrapy] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
 {'desc': [u' - By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open source XML processing library. [Prentice Hall PTR]\n'],
  'link': [u'http://www.informit.com/store/product.aspx?isbn=0130211192'],
  'title': [u'XML Processing with Python']}

Scrapy 教程示例 [英] Scrapy Tutorial Example

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Scrapy 教程示例 [英] Scrapy Tutorial Example

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭