分页级别2-scrapy python [英] Pagination level2 - scrapy python
问题描述
我不得不制作一个刮刀,但我不明白为什么它不起作用......
网站有这样的分页:
当您进入下一页时,活动"类会移动,因此在第 5 页中,是在最后一个使类处于活动状态的人之前的 baliseli"!我在 balise "li" 之后用 "active" 类抓住了这个项目:
next_page_url_xpath = '//div[@class="pagination toolbarbloc"]/ul/li[@class="active"]/following-sibling::li/a/@href'
它非常适合第 5 页的第一页......但它无法转到第 6 页以在课堂结束时抓住 balisea"......
我试试:
尝试:next_page_url_xpath = '//div[@class="pagination toolbarbloc"]/ul/li[@class="active"]/following-sibling::li/a/@href'next_page_url = begin + response.xpath(next_page_url_xpath)[0].extract()除了(值错误,索引错误):next_page_url_xpath = '//div[@class="pagination toolbarbloc"]/ul/li/a[@class="end"]/@href'next_page_url = begin + response.xpath(next_page_url_xpath)[0].extract()
有人有想法吗?:)感谢您的帮助!
import codecs从 lxml 导入 etreetest_xml = """<ul><li class="active"><span>1</span></li><li><a href="1href">2</a></li><li><a href="2href">3</a></li><li><a href="3href">4</a></li><li><a href="4href">5</a></li><li><a class="end" href="5href">></li></div>"""树 = etree.HTML(test_xml)rep = tree.xpath('//div[@class="pagination toolbarbloc"]/ul/li/a/@href')印刷代表# ['1href', '2href', '3href', '4href', '5href']我想知道我是否完全理解你所说的.如果你真的想要这样的python函数,也许它可以帮助你.
I had to make a scraper, and i don't understand why it don't work ...
The website have a pagination like that:
<div class="pagination toolbarbloc">
<ul>
<li class="active"><span>1</span></li>
<li><a href="...">2</a></li>
<li><a href="...">3</a></li>
<li><a href="...">4</a></li>
<li><a href="...">5</a></li>
<li><a class="end" href="...">>></li>
</ul>
</div>
The class "active" move when you go next page, so in page 5, it's the balise "li" just before the last one who have the class active !
I catch the item after the balise "li" with class "active" like that:
next_page_url_xpath = '//div[@class="pagination toolbarbloc"]/ul/li[@class="active"]/following-sibling::li/a/@href'
It work perfectly for the 5 first page ... but it doesn't work to go page 6 catch the balise "a" with class end ...
I try that:
try:
next_page_url_xpath = '//div[@class="pagination toolbarbloc"]/ul/li[@class="active"]/following-sibling::li/a/@href'
next_page_url = begin + response.xpath(next_page_url_xpath)[0].extract()
except (ValueError,IndexError):
next_page_url_xpath = '//div[@class="pagination toolbarbloc"]/ul/li/a[@class="end"]/@href'
next_page_url = begin + response.xpath(next_page_url_xpath)[0].extract()
Someone have an idea ? :)
Thank's for ure help !
解决方案 import codecs
from lxml import etree
test_xml = """<div class="pagination toolbarbloc">
<ul>
<li class="active"><span>1</span></li>
<li><a href="1href">2</a></li>
<li><a href="2href">3</a></li>
<li><a href="3href">4</a></li>
<li><a href="4href">5</a></li>
<li><a class="end" href="5href">>></li>
</ul>
</div>"""
tree = etree.HTML(test_xml)
rep = tree.xpath('//div[@class="pagination toolbarbloc"]/ul/li/a/@href')
print rep
# ['1href', '2href', '3href', '4href', '5href']
I wonder if I quite understand what you said. If you truely want same python function like this, maybe it can help you.
这篇关于分页级别2-scrapy python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文