Scrapy - 根据文本选择特定链接 [英] Scrapy - Select specific link based on text

查看:43
本文介绍了Scrapy - 根据文本选择特定链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这应该很容易,但我卡住了.

<a href="/en/overview/0-All_manufactures/0-All_models.html?page=2&amp;powerunit=2">链接文本 2</a>|<a href="/en/overview/0-All_manufactures/0-All_models.html?page=3&amp;powerunit=2">链接文本 3</a>|<a href="/en/overview/0-All_manufactures/0-All_models.html?page=4&powerunit=2">链接文本4</a>|<a href="/en/overview/0-All_manufactures/0-All_models.html?page=5&amp;powerunit=2">链接文本 5</a>|<!-- 下一页链接--><a href="/en/overview/0-All_manufactures/0-All_models.html?page=2&amp;powerunit=2">链接文本下一个></a>

我正在尝试使用 Scrapy (Basespider) 根据链接文本选择链接:

nextPage = HtmlXPathSelector(response).select("//div[@class='paginationControl']/a/@href").re("(.+)*?Next")

例如,我想根据它的文本是链接文本下一个"这一事实来选择下一页链接.有什么想法吗?

解决方案

使用 a[contains(text(),'Link Text Next')]:

nextPage = HtmlXPathSelector(response).select("//div[@class='paginationControl']/a[contains(text(),'Link Text Next')]/@href")

参考:XPath 文档包含函数

<小时>

附注.您的文本 Link Text Next 末尾有一个空格.为了避免在代码中包含该空格:

text()="下一个链接文本"

我认为使用 contains 更通用,但仍然足够具体.

This should be easy but I'm stuck.

<div class="paginationControl">
  <a href="/en/overview/0-All_manufactures/0-All_models.html?page=2&amp;powerunit=2">Link Text 2</a> | 
  <a href="/en/overview/0-All_manufactures/0-All_models.html?page=3&amp;powerunit=2">Link Text 3</a> | 
  <a href="/en/overview/0-All_manufactures/0-All_models.html?page=4&amp;powerunit=2">Link Text 4</a> | 
  <a href="/en/overview/0-All_manufactures/0-All_models.html?page=5&amp;powerunit=2">Link Text 5</a> |   

<!-- Next page link --> 
  <a href="/en/overview/0-All_manufactures/0-All_models.html?page=2&amp;powerunit=2">Link Text Next ></a>
</div>

I'm trying to use Scrapy (Basespider) to select a link based on it's Link text using:

nextPage = HtmlXPathSelector(response).select("//div[@class='paginationControl']/a/@href").re("(.+)*?Next")

For example, I want to select the next page link based on the fact that it's text is "Link Text Next". Any ideas?

解决方案

Use a[contains(text(),'Link Text Next')]:

nextPage = HtmlXPathSelector(response).select(
    "//div[@class='paginationControl']/a[contains(text(),'Link Text Next')]/@href")

Reference: Documentation on the XPath contains function


PS. Your text Link Text Next has a space at the end. To avoid having to include that space in the code:

text()="Link Text Next "

I think using contains is a bit more general while still being specific enough.

这篇关于Scrapy - 根据文本选择特定链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆