scrapy 和 xpath 函数“匹配"语法 [英] scrapy and xpath function 'matches' syntax

查看:70
本文介绍了scrapy 和 xpath 函数“匹配"语法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我运行的是scrapy 0.20.2.

I'm running scrapy 0.20.2.

$ scrapy shell "http://newyork.craigslist.org/ata/"

我想将所有指向广告页面的链接列表与 index.html 分开

I would like to make the list of all links to advertisements pages set apart the index.html

$ sel.xpath('//a[contains(@href,html)]')
... 
<Selector xpath='//a[contains(@href,"html")]' data=u'<a href="/mnh/atq/4243973984.html">Wicke'>,
<Selector xpath='//a[contains(@href,"html")]' data=u'<a href="/mnh/atd/4257230057.html" class'>,
<Selector xpath='//a[contains(@href,"html")]' data=u'<a href="/mnh/atd/4257230057.html">Recla'>,
<Selector xpath='//a[contains(@href,"html")]' data=u'<a href="/ata/index100.html" class="butt'>]

我想使用 XPath 匹配函数来匹配正则表达式 [0-9]+.html 形式的链接.

I would like to use the XPath matches function to match links the form of the regex [0-9]+.html.

$ sel.xpath('//a[matches(@href,"[0-9]+.html")]')
...
ValueError: Invalid XPath: //a[matches(@href,"[0-9]+.html")]

怎么了?谢谢.

推荐答案

matches 是一个 XPath 2.0 函数,scrapy 只支持 XPath 1.0(它没有任何内置的正则表达式支持).您必须使用scrapy 选择器提取所有链接,然后在Python 级别而不是在XPath 中进行正则表达式过滤.

matches is an XPath 2.0 function, and scrapy only supports XPath 1.0 (which does not have any regular expression support built in). You'll have to extract all the links using a scrapy selector and then do the regex filtering at the Python level rather than within the XPath.

这篇关于scrapy 和 xpath 函数“匹配"语法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆