分页级别2-scrapy python [英] Pagination level2 - scrapy python

查看:29
本文介绍了分页级别2-scrapy python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不得不制作一个刮刀,但我不明白为什么它不起作用......

网站有这样的分页:

当您进入下一页时,活动"类会移动,因此在第 5 页中,是在最后一个使类处于活动状态的人之前的 baliseli"!我在 balise "li" 之后用 "active" 类抓住了这个项目:

next_page_url_xpath = '//div[@class="pagination toolbarbloc"]/ul/li[@class="active"]/following-sibling::li/a/@href'

它非常适合第 5 页的第一页......但它无法转到第 6 页以在课堂结束时抓住 balisea"......

我试试:

 尝试:next_page_url_xpath = '//div[@class="pagination toolbarbloc"]/ul/li[@class="active"]/following-sibling::li/a/@href'next_page_url = begin + response.xpath(next_page_url_xpath)[0].extract()除了(值错误,索引错误):next_page_url_xpath = '//div[@class="pagination toolbarbloc"]/ul/li/a[@class="end"]/@href'next_page_url = begin + response.xpath(next_page_url_xpath)[0].extract()

有人有想法吗?:)感谢您的帮助!

解决方案

import codecs从 lxml 导入 etreetest_xml = """

<ul><li class="active"><span>1</span></li><li><a href="1href">2</a></li><li><a href="2href">3</a></li><li><a href="3href">4</a></li><li><a href="4href">5</a></li><li><a class="end" href="5href">></li></div>"""树 = etree.HTML(test_xml)rep = tree.xpath('//div[@class="pagination toolbarbloc"]/ul/li/a/@href')印刷代表# ['1href', '2href', '3href', '4href', '5href']

我想知道我是否完全理解你所说的.如果你真的想要这样的python函数,也许它可以帮助你.

I had to make a scraper, and i don't understand why it don't work ...

The website have a pagination like that:

<div class="pagination toolbarbloc">
        <ul>
                <li class="active"><span>1</span></li>
                <li><a href="...">2</a></li>
                <li><a href="...">3</a></li>
                <li><a href="...">4</a></li>
                <li><a href="...">5</a></li>
                <li><a class="end" href="...">>></li>
        </ul>
</div>

The class "active" move when you go next page, so in page 5, it's the balise "li" just before the last one who have the class active ! I catch the item after the balise "li" with class "active" like that:

next_page_url_xpath = '//div[@class="pagination toolbarbloc"]/ul/li[@class="active"]/following-sibling::li/a/@href'

It work perfectly for the 5 first page ... but it doesn't work to go page 6 catch the balise "a" with class end ...

I try that:

    try:
        next_page_url_xpath = '//div[@class="pagination toolbarbloc"]/ul/li[@class="active"]/following-sibling::li/a/@href'
        next_page_url = begin + response.xpath(next_page_url_xpath)[0].extract()
    except (ValueError,IndexError):
        next_page_url_xpath = '//div[@class="pagination toolbarbloc"]/ul/li/a[@class="end"]/@href'
        next_page_url = begin + response.xpath(next_page_url_xpath)[0].extract()

Someone have an idea ? :) Thank's for ure help !

解决方案

import codecs
from lxml import etree

test_xml = """<div class="pagination toolbarbloc">
        <ul>
                <li class="active"><span>1</span></li>
                <li><a href="1href">2</a></li>
                <li><a href="2href">3</a></li>
                <li><a href="3href">4</a></li>
                <li><a href="4href">5</a></li>
                <li><a class="end" href="5href">>></li>
        </ul>
</div>"""

tree = etree.HTML(test_xml)
rep = tree.xpath('//div[@class="pagination toolbarbloc"]/ul/li/a/@href')

print rep
# ['1href', '2href', '3href', '4href', '5href']

I wonder if I quite understand what you said. If you truely want same python function like this, maybe it can help you.

这篇关于分页级别2-scrapy python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆