用于 Scrapy 的 XPATH [英] XPATH for Scrapy

查看:47
本文介绍了用于 Scrapy 的 XPATH的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我正在使用 SCRAPY 刮掉一个网站的书籍.

So i am using SCRAPY to scrape off the books of a website.

我让爬虫工作并且爬行正常,但是在使用 XPATH 中的选择清理 HTML 时,它有点不正常.现在因为它是一个图书网站,我在每个页面上都有将近 131 本书,它们的 XPATH 变成了这样

I have the crawler working and it crawls fine, but when it comes to cleaning the HTML using the select in XPATH it is kinda not working out right. Now since it is a book website, i have almost 131 books on each page and their XPATH comes to be likes this

例如获取书名 -

1st Book --- > /html/body/div/div[3]/div/div/div[2]/div/ul/li/a/span
2nd Book --->  /html/body/div/div[3]/div/div/div[2]/div/ul/li[2]/a/span 
3rd book --->  /html/body/div/div[3]/div/div/div[2]/div/ul/li[3]/a/span 

DIV[] 编号随着书的增加而增加.我不确定如何将其放入循环中,以便它捕获所有标题.我也必须为图像和作者姓名执行此操作,但我认为它会相似.只需要完成这个初始的.

The DIV[] number increases with the book. I am not sure how to get this into a loop, so that it catches all the titles. I have to do this for Images and Author names too, but i think it will be similar. Just need to get this initial one done.

提前感谢您的帮助.

推荐答案

有多种方法可以得到这个

There are different ways to get this

  1. 选择多个节点的最佳方式是,根据 id 或类进行选择.例如:

  1. Best to select multiple nodes is, selecting on the basis of ids or class. e.g:

sel.xpath("//div[@id='id']")

  • 你可以这样选择

  • You can select like this

    for i in range(0, upto_num_of_divs):
        list = sel.xpath("//div[%s]" %i)
    

  • 你可以这样选择

  • You can select like this

    for i in range(0, upto_num_of_divs):
        list = sel.xpath("//div[position > =1 and position() < upto_num_of_divs])
    

  • 这篇关于用于 Scrapy 的 XPATH的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆