如何翻译这个XPath前pression到BeautifulSoup? [英] How can I translate this XPath expression to BeautifulSoup?

查看:131
本文介绍了如何翻译这个XPath前pression到BeautifulSoup?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在回答一个<一个href=\"http://stackoverflow.com/questions/1813921/how-to-search-a-html-page-for-an-item-in-a-given-list/1814616#1814616\">$p$pvious问题,几个人建议我使用 BeautifulSoup 为我的项目。我一直在努力与他们的文档,我只是无法解析它。有人能指出我的部分,在那里我应该能这个前pression翻译为BeautifulSoup前pression?

In answer to a previous question, several people suggested that I use BeautifulSoup for my project. I've been struggling with their documentation and I just cannot parse it. Can somebody point me to the section where I should be able to translate this expression to a BeautifulSoup expression?

hxs.select('//td[@class="altRow"][2]/a/@href').re('/.a\w+')

以上前pression从 Scrapy 。我试图以应用正则表达式重('\\ A \\ W +') TD类altRow 来获得从那里的链接。

The above expression is from Scrapy. I'm trying to apply the regex re('\.a\w+') to td class altRow to get the links from there.

我也要AP preciate指向任何其他教程或文件。我找不到任何。

I would also appreciate pointers to any other tutorials or documentation. I couldn't find any.

感谢您的帮助。

编辑:
我在看这个页面

>>> soup.head.title
<title>White & Case LLP - Lawyers</title>
>>> soup.find(href=re.compile("/cabel"))
>>> soup.find(href=re.compile("/diversity"))
<a href="/diversity/committee">Committee</a>

然而,如果你看一下页面的源代码/卡贝尔有:

 <td class="altRow" valign="middle" width="34%"> 
 <a href='/cabel'>Abel, Christian</a>

由于某些原因,搜索结果是不可见的BeautifulSoup,但他们看到的XPath,因为 hxs.select('// TD [@类=altRow] [2] / A / @href')。重('/。一\\ W +')捕获/卡贝尔

编辑:
cobbal:它仍然没有工作。但是,当我搜索这样的:

cobbal: It is still not working. But when I search this:

>>>soup.findAll(href=re.compile(r'/.a\w+'))
[<link href="/FCWSite/Include/styles/main.css" rel="stylesheet" type="text/css" />, <link rel="shortcut icon" type="image/ico" href="/FCWSite/Include/main_favicon.ico" />, <a href="/careers/northamerica">North America</a>, <a href="/careers/middleeastafrica">Middle East Africa</a>, <a href="/careers/europe">Europe</a>, <a href="/careers/latinamerica">Latin America</a>, <a href="/careers/asia">Asia</a>, <a href="/diversity/manager">Diversity Director</a>]
>>>

它返回所有与第二字符a,而不是律师名称的链接。所以由于某些原因这些链接(如/卡贝尔)是不BeautifulSoup可见。我不明白为什么。

it returns all the links with second character "a" but not the lawyer names. So for some reason those links (such as "/cabel") are not visible to BeautifulSoup. I don't understand why.

推荐答案

我知道BeautifulSoup是规范的HTML解析模块,但有时你只是想从一些HTML凑了一些子,和pyparsing有一些有用的方法来做到这一点。使用这种code:

I know BeautifulSoup is the canonical HTML parsing module, but sometimes you just want to scrape out some substrings from some HTML, and pyparsing has some useful methods to do this. Using this code:

from pyparsing import makeHTMLTags, withAttribute, SkipTo
import urllib

# get the HTML from your URL
url = "http://www.whitecase.com/Attorneys/List.aspx?LastName=&FirstName="
page = urllib.urlopen(url)
html = page.read()
page.close()

# define opening and closing tag expressions for <td> and <a> tags
# (makeHTMLTags also comprehends tag variations, including attributes, 
# upper/lower case, etc.)
tdStart,tdEnd = makeHTMLTags("td")
aStart,aEnd = makeHTMLTags("a")

# only interested in tdStarts if they have "class=altRow" attribute
tdStart.setParseAction(withAttribute(("class","altRow")))

# compose total matching pattern (add trailing tdStart to filter out 
# extraneous <td> matches)
patt = tdStart + aStart("a") + SkipTo(aEnd)("text") + aEnd + tdEnd + tdStart

# scan input HTML source for matching refs, and print out the text and 
# href values
for ref,s,e in patt.scanString(html):
    print ref.text, ref.a.href

我提取914引用从您的网页,从亚伯到Zupikova。

I extracted 914 references from your page, from Abel to Zupikova.

Abel, Christian /cabel
Acevedo, Linda Jeannine /jacevedo
Acuña, Jennifer /jacuna
Adeyemi, Ike /igbadegesin
Adler, Avraham /aadler
...
Zhu, Jie /jzhu
Zídek, Aleš /azidek
Ziółek, Agnieszka /aziolek
Zitter, Adam /azitter
Zupikova, Jana /jzupikova

这篇关于如何翻译这个XPath前pression到BeautifulSoup?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆