使用python中的XPath提取包含关键字的href值 [英] extract href values containing keyword using XPath in python

查看:180
本文介绍了使用python中的XPath提取包含关键字的href值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道这个问题的变体已经问了很多遍了,但是我无法破解它并得到我想要的东西.

I know variants of this question have been asked a number of times but I've not been able to crack it and get what I want.

我有一个网站,其中包含一些表格.感兴趣的表包含一列,其中每一行包含超链接到不同页面的单词 Text .这是上面链接页面第一行中的一个特定示例:

I have a website which has a few tables in it. The table of interest contains a column where each row contains the word Text hyperlinked to a different page. Here is a specific example from the first row on the above linked page:

<a href="_alexandria_RIC_VI_099b_K-AP.txt">Text</a>

这是常规模式:

<a href="_something_something-blah-blah.txt">Text</a>

现在我正在这样做:

import requests  
import lxml.html as lh
page = requests.get("http://www.wildwinds.com/coins/ric/constantine/t.html")
doc = lh.fromstring(page.content)
href_elements = doc.xpath('/html/body/center/table/tbody/tr/td/a/@href')
print(href_elements)

所需的响应应该是一组看起来像这样的项目: _something_something-blah-blah.txt 我得到的是一个空数组.

The desired response should be an array of items looking like this: _something_something-blah-blah.txt What I get is an empty array.

由于页面上还有其他我不感兴趣的href元素,因此我还想修改查询以仅获取其值中包含 .txt 的href元素.

Since the page has other href elements I'm not interested in, I also want to modify the query to only grab the href elements that contain .txt in their values.

非常感谢您能提供的帮助!

Any help you can provide is much appreciated!

推荐答案

尝试类似的方法:

href_elements = doc.xpath('//center//table//a[contains(@href,".txt")]["Text"]/@href')
for href in href_elements:
    print(href)

输出:

_alexandria_RIC_VI_099b_K-AP.txt
_alexandria_RIC_VI_100.txt
_alexandria_RIC_VI_136.txt
_alexandria_RIC_VI_156.txt

这篇关于使用python中的XPath提取包含关键字的href值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆