用于XML的LXML中的正则表达式 [英] Regex in lxml for python

查看:66
本文介绍了用于XML的LXML中的正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在xpath命令中实现正则表达式遇到麻烦.我的目标是下载主页的html内容以及主页上所有超链接的内容.但是,该程序引发异常,因为某些href链接未连接任何内容(例如'//:javascript'或'#').我将如何在xpath中使用regex?除了非绝对href以外,还有其他更简单的方法吗?

I having trouble implementing regex within xpath command. My goal here is to download the html contents of the main page, as well as the contents of all hyperlinks on the main page. However, the program throws exceptions because some of the href links do not connect to anything (ex. '//:javascript', or '#'). How would I use regex in xpath? Is there an easier way to except non-absolute hrefs?

from lxml import html
import requests
main_pg = requests.get("http://gazetaolekma.ru/")
with open("Sample.html","w", encoding='utf-8') as doc:
    doc.write(main_pg.text)
tree = html.fromstring(main_pg.content)
hrefs = tree.xpath('//a[re:findall("^(http|https|ftp):.*")]/@href')
for href in hrefs:
    link_page = requests.get(href)
    with open("%s.html"%href[0:9], "w", encoding ='utf-8') as href_doc:
        href_doc.write(link_page.text)

推荐答案

对于xpath 1.0,您始终可以在谓词中使用或:

with xpath 1.0 you can always use or in your predicate:

hrefs = tree.xpath('//a/@href[starts-with(., "http") or starts-with(., "ftp")]')

这篇关于用于XML的LXML中的正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆