使用scrapy时访问网页 [英] webpage access while using scrapy

查看：76 发布时间：2021/6/26 20:21:58 python python-2.7 web-scraping scrapy scrapy-spider

本文介绍了使用scrapy时访问网页的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是python和scrapy的新手.我按照教程并尝试抓取几个网页.我使用了 tutorial 中的代码并替换了 URL - http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p=0和http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&;nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p=1.

I am new to python and scrapy. I followed the tutorial and tried to crawl few webpages. I used the code in the tutorial and replaced the URLs - http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p=0 and http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p=1 respectively.

生成 html 文件时，未显示整个数据.仅此 URL 之前的数据 - http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=0&sd=0&states=ALL&显示了 near=&ps=20&p=0.

when the html file is generated the whole data is not getting displayed. only the data upto this URL - http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=0&sd=0&states=ALL&near=&ps=20&p=0 is shown.

此外，在运行该命令时，第二个 URL 已被删除，表明它是重复的，并且只创建了一个 html 文件.

Also while the command is run the second URL has been removed stating it as duplicate and only one html file is being created.

我想知道网页是否拒绝访问该特定数据，或者我是否应该更改代码以获取准确数据.

I want to know if the webpage denies access to that specific data or should i change my code to get the precise data.

当我进一步给出 shell 命令时，我收到错误消息.当我使用 crawl 命令和 shell 命令时的结果是 -

When i further give the shell command i am getting error. The result when i used the crawl command and shell command was -

    C:\Users\MinorMiracles\Desktop\tutorial>python -m scrapy.cmdline crawl citydata
2016-10-19 12:00:27 [scrapy] INFO: Scrapy 1.2.0 started (bot: tutorial)
2016-10-19 12:00:27 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tu
torial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'ROBOTSTXT_OBEY': True,
 'BOT_NAME': 'tutorial'}
2016-10-19 12:00:27 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2016-10-19 12:00:27 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-10-19 12:00:27 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-10-19 12:00:27 [scrapy] INFO: Enabled item pipelines:
[]
2016-10-19 12:00:27 [scrapy] INFO: Spider opened
2016-10-19 12:00:27 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 i
tems (at 0 items/min)
2016-10-19 12:00:27 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-10-19 12:00:27 [scrapy] DEBUG: Filtered duplicate request: <GET http://www.
city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=
&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX
&i6819=1&ps=20&p=1> - no more duplicates will be shown (see DUPEFILTER_DEBUG to
show all duplicates)
2016-10-19 12:00:28 [scrapy] DEBUG: Crawled (200) <GET http://www.city-data.com/
robots.txt> (referer: None)
2016-10-19 12:00:29 [scrapy] DEBUG: Crawled (200) <GET http://www.city-data.com/
advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=691
4&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20
&p=0> (referer: None)
2016-10-19 12:00:29 [citydata] DEBUG: Saved file citydata-advanced.html
2016-10-19 12:00:29 [scrapy] INFO: Closing spider (finished)
2016-10-19 12:00:29 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 459,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 44649,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'dupefilter/filtered': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 10, 19, 6, 30, 29, 751000),
 'log_count/DEBUG': 5,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2016, 10, 19, 6, 30, 27, 910000)}
2016-10-19 12:00:29 [scrapy] INFO: Spider closed (finished)

C:\Users\MinorMiracles\Desktop\tutorial>python -m scrapy.cmdline shell 'http://w
ww.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&ne
ar=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=
MAX&i6819=1&ps=20&p=0'
2016-10-19 12:21:51 [scrapy] INFO: Scrapy 1.2.0 started (bot: tutorial)
2016-10-19 12:21:51 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tu
torial.spiders', 'ROBOTSTXT_OBEY': True, 'DUPEFILTER_CLASS': 'scrapy.dupefilters
.BaseDupeFilter', 'SPIDER_MODULES': ['tutorial.spiders'], 'BOT_NAME': 'tutorial'
, 'LOGSTATS_INTERVAL': 0}
2016-10-19 12:21:51 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2016-10-19 12:21:51 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-10-19 12:21:51 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-10-19 12:21:51 [scrapy] INFO: Enabled item pipelines:
[]
2016-10-19 12:21:51 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-10-19 12:21:51 [scrapy] INFO: Spider opened
2016-10-19 12:21:53 [scrapy] DEBUG: Retrying <GET http://'http:/robots.txt> (fai
led 1 times): DNS lookup failed: address "'http:" not found: [Errno 11004] getad
drinfo failed.
2016-10-19 12:21:56 [scrapy] DEBUG: Retrying <GET http://'http:/robots.txt> (fai
led 2 times): DNS lookup failed: address "'http:" not found: [Errno 11004] getad
drinfo failed.
2016-10-19 12:21:58 [scrapy] DEBUG: Gave up retrying <GET http://'http:/robots.t
xt> (failed 3 times): DNS lookup failed: address "'http:" not found: [Errno 1100
4] getaddrinfo failed.
2016-10-19 12:21:58 [scrapy] ERROR: Error downloading <GET http://'http:/robots.
txt>: DNS lookup failed: address "'http:" not found: [Errno 11004] getaddrinfo f
ailed.
DNSLookupError: DNS lookup failed: address "'http:" not found: [Errno 11004] get
addrinfo failed.
2016-10-19 12:22:00 [scrapy] DEBUG: Retrying <GET http://'http://www.city-data.c
om/advanced/search.php#body?fips=0> (failed 1 times): DNS lookup failed: address
 "'http:" not found: [Errno 11004] getaddrinfo failed.
2016-10-19 12:22:03 [scrapy] DEBUG: Retrying <GET http://'http://www.city-data.c
om/advanced/search.php#body?fips=0> (failed 2 times): DNS lookup failed: address
 "'http:" not found: [Errno 11004] getaddrinfo failed.
2016-10-19 12:22:05 [scrapy] DEBUG: Gave up retrying <GET http://'http://www.cit
y-data.com/advanced/search.php#body?fips=0> (failed 3 times): DNS lookup failed:
 address "'http:" not found: [Errno 11004] getaddrinfo failed.
Traceback (most recent call last):
  File "C:\Python27\lib\runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "C:\Python27\lib\runpy.py", line 72, in _run_code
    exec code in run_globals
  File "C:\Python27\lib\site-packages\scrapy\cmdline.py", line 161, in <module>
    execute()
  File "C:\Python27\lib\site-packages\scrapy\cmdline.py", line 142, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "C:\Python27\lib\site-packages\scrapy\cmdline.py", line 88, in _run_print
_help
    func(*a, **kw)
  File "C:\Python27\lib\site-packages\scrapy\cmdline.py", line 149, in _run_comm
and
    cmd.run(args, opts)
  File "C:\Python27\lib\site-packages\scrapy\commands\shell.py", line 71, in run

    shell.start(url=url)
  File "C:\Python27\lib\site-packages\scrapy\shell.py", line 47, in start
    self.fetch(url, spider)
  File "C:\Python27\lib\site-packages\scrapy\shell.py", line 112, in fetch
    reactor, self._schedule, request, spider)
  File "C:\Python27\lib\site-packages\twisted\internet\threads.py", line 122, in
 blockingCallFromThread
    result.raiseException()
  File "<string>", line 2, in raiseException
twisted.internet.error.DNSLookupError: DNS lookup failed: address "'http:" not f
ound: [Errno 11004] getaddrinfo failed.
'csize' is not recognized as an internal or external command,
operable program or batch file.
'sc' is not recognized as an internal or external command,
operable program or batch file.
'sd' is not recognized as an internal or external command,
operable program or batch file.
'states' is not recognized as an internal or external command,
operable program or batch file.
'near' is not recognized as an internal or external command,
operable program or batch file.
'nam_crit1' is not recognized as an internal or external command,
operable program or batch file.
'b6914' is not recognized as an internal or external command,
operable program or batch file.
'e6914' is not recognized as an internal or external command,
operable program or batch file.
'i6914' is not recognized as an internal or external command,
operable program or batch file.
'nam_crit2' is not recognized as an internal or external command,
operable program or batch file.
'b6819' is not recognized as an internal or external command,
operable program or batch file.
'e6819' is not recognized as an internal or external command,
operable program or batch file.
'i6819' is not recognized as an internal or external command,
operable program or batch file.
'ps' is not recognized as an internal or external command,
operable program or batch file.
'p' is not recognized as an internal or external command,
operable program or batch file.

我的代码是

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "citydata"

    def start_requests(self):
        urls = [
            'http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p=0',
            'http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p=1',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'citydata-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

请有人指导我.

使用scrapy时访问网页 [英] webpage access while using scrapy

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用scrapy时访问网页 [英] webpage access while using scrapy

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭