Scrapy - 控制流中的 LinkExtractor 及其不起作用的原因 [英] Scrapy - LinkExtractor in control flow and why it doesn't work
问题描述
我想了解为什么我的 LinkExtractor
不起作用以及它何时在爬行循环中实际运行?
I'm trying to understand why my LinkExtractor
doesn't work and when it is actually running in the crawl loop?
This is the page I'm crawling.
- 每个页面上有 25 个列表,它们的链接在
parse_page
中解析 - 然后在
parse_item
中解析每个抓取到的链接
- There are 25 listings on each page and their links are parsed in
parse_page
- Then each crawled link are parsed in
parse_item
此脚本可以毫无问题地抓取第一页及其中的项目.问题是,它不遵循 https://www.yenibiris.com/is-ilanlari?q=yazilim&sayfa=2(sayfa 在土耳其语中的意思是页面)和其他下一页.
This script crawls the first page and the items in it without any problem. The problem is, it doesn't follow to https://www.yenibiris.com/is-ilanlari?q=yazilim&sayfa=2 (sayfa means page in Turkish) and the other next pages.
我认为我的 Rule
和 LinkExtractor
是正确的,因为当我尝试允许所有链接时,它也不起作用.
I think my Rule
and LinkExtractor
are correct because when I tried to allow all links, it didn't work either.
我的问题是;
LinkExtractors
什么时候应该在这个脚本中运行,为什么它们没有运行?- 如何让蜘蛛跟随到下一页,解析页面并使用
LinkExtractors
解析其中的项目? - 如何使用
LinkExtractor
实现parse_page
?
- When are the
LinkExtractors
are supposed to run in this script and why they are not running? - How can I make the spider follow to the next pages, parse the pages and parse the items in them with
LinkExtractors
? - How can I implement the
parse_page
with theLinkExtractor
?
这是我的蜘蛛的相关部分.
This is my spider's relevant parts.
class YenibirisSpider(CrawlSpider):
name = 'yenibirisspider'
rules = (
Rule(LinkExtractor(allow=(r'.*&sayfa=\d+',)),
callback='parse_page',
follow=True),
)
def __init__(self):
super().__init__()
self.allowed_domains = ['yenibiris.com']
self.start_urls = [
'https://www.yenibiris.com/is-ilanlari?q=yazilim',
]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url=url,
method='GET',
callback=self.parse_page
)
def parse_page(self, response):
items = response.css('div.listViewRowsContainer div div div.jobTitleLnk a::attr(href)').getall()
for item in items:
yield scrapy.Request(
url=item,
method='GET',
callback=self.parse_items
)
def parse_items(self, response):
# crawling the item without any problem here
yield item
推荐答案
我不想回答我自己的问题,但我想我想通了.当我定义 start_requests
函数时,我可能会覆盖 rules
行为,所以它不起作用.当我删除 __init__
和 start_requests
函数时,蜘蛛按预期工作.
I hate to answer my own question, but I think I figured it out. When I define the start_requests
function, I might be overriding the rules
behavior, so it didn't work. When I remove the __init__
and start_requests
functions, spider works as intended.
class YenibirisSpider(CrawlSpider):
name = 'yenibirisspider'
start_urls = [
'https://www.yenibiris.com/is-ilanlari?q=yazilim&sayfa=1',
]
rules = (
Rule(LinkExtractor(allow=(r'.*&sayfa=\d+',)), callback='parse_page', follow=True),
)
def parse_page(self, response):
items = response.css('div.listViewRowsContainer div div div.jobTitleLnk a::attr(href)').getall()
for item in items:
yield scrapy.Request(
url=item,
method='GET',
callback=self.parse_items
)
def parse_items(self, response):
# crawling the item without any problem here
yield item
这篇关于Scrapy - 控制流中的 LinkExtractor 及其不起作用的原因的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!