Scrapy - 如何在没有“下一步"按钮的情况下管理分页? [英] Scrapy - how to manage pagination without 'Next' button?
问题描述
我正在从诸如 this 之类的网站抓取文章的内容 没有下一步"按钮的地方.ItemLoader
是从 response.meta
对象中的 parse_issue
以及一些附加数据(如 section_name
)传递的.这是函数:
I'm scraping the content of articles from a site like this where there is no 'Next' button to follow. ItemLoader
is passed from parse_issue
in the response.meta
object as well as some additional data like section_name
. Here is the function:
def parse_article(self, response):
self.logger.info('Parse function called parse_article on {}'.format(response.url))
acrobat = response.xpath('//div[@class="txt__lead"]/p[contains(text(), "Plik do pobrania w wersji (pdf) - wymagany Acrobat Reader")]')
limiter = response.xpath('//p[@class="limiter"]')
if not acrobat and not limiter:
loader = ItemLoader(item=response.meta['periodical_item'].copy(), response=response)
loader.add_value('section_name', response.meta['section_name'])
loader.add_value('article_url', response.url)
loader.add_xpath('article_authors', './/p[@class="l doc-author"]/b')
loader.add_xpath('article_title', '//div[@class="cf txt "]//h1')
loader.add_xpath('article_intro', '//div[@class="txt__lead"]//p')
article_content = response.xpath('.//div[@class=" txt__rich-area"]//p').getall()
# # check for pagiantion
next_page_url = response.xpath('//span[@class="pgr_nrs"]/span[contains(text(), 1)]/following-sibling::a[1]/@href').get()
if next_page_url:
# I'm not sure what should be here... Something like this: (???)
yield response.follow(next_page_url, callback=self.parse_article, meta={
'periodical_item' : loader.load_item(),
'article_content' : article_content
})
else:
loader.add_xpath('article_content', article_content)
yield loader.load_item()
问题出在parse_article
函数中:我不知道如何将所有页面的段落内容合并为一项.有大佬知道怎么解决吗?
The problem is in parse_article
function: I don't know how to combine the content of paragraphs from all pages into the one item. Does anybody know how to solve this?
推荐答案
您的 parse_article
看起来不错.如果问题只是将 article_content
添加到加载器,您只需要从 response.meta
中获取它:
Your parse_article
looks good. If the issue is just adding the article_content
to the loader, you just needed to fetch it from the response.meta
:
我会更新这一行:
article_content = response.meta.get('article_content', '') + response.xpath('.//div[@class=" txt__rich-area"]//p').getall()
这篇关于Scrapy - 如何在没有“下一步"按钮的情况下管理分页?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!