通过lxml/Python中的xpath选择所有带有href属性的锚标签,其中href属性包含多个值之一 [英] Select all anchor tags with an href attribute that contains one of multiple values via xpath in lxml / Python

查看:372
本文介绍了通过lxml/Python中的xpath选择所有带有href属性的锚标签,其中href属性包含多个值之一的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要自动扫描大量html文档以查找被锚标记包围的广告横幅,例如:

I need to automatically scan lots of html documents for ad banners that are surrounded by an anchor tag, e.g.:

<a href="http://ad_network.com/abc.html">
    <img src="ad_banner.jpg">
</a>

作为使用xpath的新手,我可以通过lxml这样选择锚点:

As a newbie with xpath, I can select such anchors via lxml like so:

text = '''
    <a href="http://ad_network.com/abc.html">
        <img src="ad_banner.jpg">
    </a>'''

root = lxml.html.fromstring(text)
print root.xpath('//a[contains(@href,("ad_network.")) or contains(@href,("other_ad_network."))][descendant::img]')

在示例中,我检查了两个不同的域:"ad_network".和"other_ad_network".但是,要检查的域超过25个,通过用或"连接所有这些conatains-directives,xpath表达式将变得非常长.而且我担心该表达式在CPU资源方面会非常低效.有检查某些包含"值的语法吗?

In the example I check on two different domains: "ad_network." and "other_ad_network.". However, there are over 25 domains to check and the xpath expression would get terribly long by connecting all those conatains-directives by "or". And I fear the expression would be pretty inefficient concerning CPU ressources. Is there some syntax for checking on multiple "contains" values?

我也可以通过正则表达式在一行代码中获得有关的链接.但是,尽管html代码已由lxml规范化,但是regex似乎从来不是进行此类工作的好选择...任何帮助表示赞赏!

I could get the concerned links also via regex in a single line of code. Yet, although the html code is normalized by lxml, regex seems never to be a good choice for that kind of work ... Any help appreciated!

推荐答案

仅仅执行一堆'or'可能不是那么糟糕.使用python构建xpath,以免获得编写者的抽筋,然后对其进行预编译.实际的xpath代码在libxml中,应该很快.

It might not be that bad just to do a bunch of 'or's. Build the xpath with python so that you don't get writer's cramp and then precompile it. The actual xpath code is in libxml and should be fast.

sites=['aaa', 'bbb']
contains = ' or '.join('contains(@href,(%s))' % site for site in sites)
anchor_xpath = etree.XPath('//a[%s][descendant::img]' % contains)

这篇关于通过lxml/Python中的xpath选择所有带有href属性的锚标签,其中href属性包含多个值之一的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆