我们可以使用XPath与BeautifulSoup? [英] can we use xpath with BeautifulSoup?

查看:876
本文介绍了我们可以使用XPath与BeautifulSoup?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用BeautifulSoup刮一个网址,我有以下code

I am using BeautifulSoup to scrape a url and I had the following code

import urllib
import urllib2
from BeautifulSoup import BeautifulSoup

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
the_page = response.read()
soup = BeautifulSoup(the_page)
soup.findAll('td',attrs={'class':'empformbody'})

现在在上面code,我们可以使用的findAll 来得到与其相关的标签和信息,但我想用XPath是可以使用XPath与BeautifulSoup,如果可能的话任何人都可以请给我一个例子code,这样它会更有帮助。

Now in the above code we can use findAll to get tags and information related to them, but i want to use xpath is it possible to use xpath with BeautifulSoup, if possible can anyone please provide me an example code so that it will be more helpful.

推荐答案

不,BeautifulSoup,其本身不支持的XPath前pressions。

Nope, BeautifulSoup, by itself, does not support XPath expressions.

这是另一种库, LXML 确实的支持的XPath 1.0。它有一个 BeautifulSoup兼容模式它会尝试解析HTML碎汤做的方式。但是,默认LXML HTML解析器做解析HTML破碎的一样好工作,我相信比较快。

An alternative library, lxml, does support XPath 1.0. It has a BeautifulSoup compatible mode where it'll try and parse broken HTML the way Soup does. However, the default lxml HTML parser does just as good a job of parsing broken HTML, and I believe is faster.

一旦你解析你的文档转换为LXML树,你可以使用 .xpath()的方法来搜索元素。

Once you've parsed your document into an lxml tree, you can use the .xpath() method to search for elements.

import urllib2
from lxml import etree

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = urllib2.urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
tree.xpath(xpathselector)

可能您感兴趣的是 CSS选择器支持;在 CSSSelector 类转化CSS语句转换为XPath的前pressions,让您的搜索 td.empformbody 容易得多

Of possible interest to you is the CSS Selector support; the CSSSelector class translates CSS statements into XPath expressions, making your search for td.empformbody that much easier:

from lxml.cssselect import CSSSelector

td_empformbody = CSSSelector('td.empformbody')
for elem in td_empformbody(tree):
    # Do something with these table cells.

一圈下来:BeautifulSoup本身的确实的有pretty体面的 CSS选择器支持

Coming full circle: BeautifulSoup itself does have pretty decent CSS selector support:

for cell in soup.select('table#foobar td.empformbody'):
    # Do something with these table cells.

这篇关于我们可以使用XPath与BeautifulSoup?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆