如何将此 XPath 表达式转换为 BeautifulSoup? [英] How can I translate this XPath expression to BeautifulSoup?

查看:13
本文介绍了如何将此 XPath 表达式转换为 BeautifulSoup?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

回答 上一个问题,有几个人建议我使用BeautifulSoup为我的项目.我一直在努力处理他们的文档,但我无法解析它.有人可以指出我应该能够将此表达式转换为 BeautifulSoup 表达式的部分吗?

hxs.select('//td[@class="altRow"][2]/a/@href').re('/.aw+')

以上表达来自Scrapy.我正在尝试将正则表达式 re('.aw+') 应用到 td class altRow 以从那里获取链接.

我也很感激任何其他教程或文档的指针.我找不到.

感谢您的帮助.

我正在查看这个页面:

<预><代码>>>>汤头标题<title>白&Case LLP - 律师</title>>>>汤.find(href=re.compile("/cabel"))>>>汤.find(href=re.compile("/多样性"))<a href="/diversity/committee">委员会</a>

但是,如果您查看页面源代码 "/cabel" 是否存在:

 <a href='/cabel'>亚伯,克里斯蒂安</a>

由于某些原因,搜索结果对 BeautifulSoup 不可见,但对 XPath 可见,因为 hxs.select('//td[@class="altRow"][2]/a/@href').re('/.aw+') 捕获/cabel"

cobbal:还是不行.但是当我搜索这个时:

>>>soup.findAll(href=re.compile(r'/.aw+'))[<link href="/FCWSite/Include/styles/main.css" rel="stylesheet" type="text/css"/>, <link rel="shortcut icon" type="image/ico"href="/FCWSite/Include/main_favicon.ico"/>、<a href="/careers/northamerica">北美</a>、<a href="/careers/middleeastafrica">Middle东非</a>、<a href="/careers/europe">欧洲</a>、<a href="/careers/latinamerica">拉丁美洲</a>、<a href="/careers/asia">亚洲</a>,<a href="/diversity/manager">多元化总监</a>]>>>

它返回所有带有第二个字符a"的链接,但不返回律师姓名.因此,出于某种原因,BeautifulSoup 看不到这些链接(例如/cabel").我不明白为什么.

解决方案

我知道 BeautifulSoup 是规范的 HTML 解析模块,但有时你只想从一些 HTML 中抓取一些子字符串,而 pyparsing 有一些有用的方法来做到这一点.使用此代码:

from pyparsing import makeHTMLTags, withAttribute, SkipTo导入 urllib# 从你的 URL 中获取 HTMLurl = "http://www.whitecase.com/Attorneys/List.aspx?LastName=&FirstName="页面 = urllib.urlopen(url)html = page.read()page.close()# 为 <td> 定义开始和结束标记表达式;和<a>标签#(makeHTMLTags 也包含标签变体,包括属性,# 大写/小写等)tdStart,tdEnd = makeHTMLTags("td")aStart,aEnd = makeHTMLTags("a")# 只对具有class=altRow"属性的 tdStarts 感兴趣tdStart.setParseAction(withAttribute(("class","altRow")))# 组合总匹配模式(添加尾随 tdStart 以过滤掉#无关的<td>火柴)patt = tdStart + aStart("a") + SkipTo(aEnd)("text") + aEnd + tdEnd + tdStart# 扫描输入的 HTML 源以匹配 refs,并打印出文本和# href 值对于 patt.scanString(html) 中的 ref,s,e:打印 ref.text, ref.a.href

我从您的页面中提取了 914 条参考文献,从 Abel 到 Zupikova.

亚伯,克里斯蒂安/cabel阿塞维多,琳达·珍妮/jacevedoAcuña, 詹妮弗/jacunaAdeyemi, Ike/igbadegesin阿德勒,阿夫拉罕/aadler...朱杰/jzhuZádek, AleÅ¡/azidekZiółek, Agnieszka/aziolekZitter, 亚当/azitter朱皮科娃,亚娜/jzupikova

In answer to a previous question, several people suggested that I use BeautifulSoup for my project. I've been struggling with their documentation and I just cannot parse it. Can somebody point me to the section where I should be able to translate this expression to a BeautifulSoup expression?

hxs.select('//td[@class="altRow"][2]/a/@href').re('/.aw+')

The above expression is from Scrapy. I'm trying to apply the regex re('.aw+') to td class altRow to get the links from there.

I would also appreciate pointers to any other tutorials or documentation. I couldn't find any.

Thanks for your help.

Edit: I am looking at this page:

>>> soup.head.title
<title>White & Case LLP - Lawyers</title>
>>> soup.find(href=re.compile("/cabel"))
>>> soup.find(href=re.compile("/diversity"))
<a href="/diversity/committee">Committee</a> 

Yet, if you look at the page source "/cabel" is there:

 <td class="altRow" valign="middle" width="34%"> 
 <a href='/cabel'>Abel, Christian</a> 

For some reason, search results are not visible to BeautifulSoup, but they are visible to XPath because hxs.select('//td[@class="altRow"][2]/a/@href').re('/.aw+') catches "/cabel"

Edit: cobbal: It is still not working. But when I search this:

>>>soup.findAll(href=re.compile(r'/.aw+'))
[<link href="/FCWSite/Include/styles/main.css" rel="stylesheet" type="text/css" />, <link rel="shortcut icon" type="image/ico" href="/FCWSite/Include/main_favicon.ico" />, <a href="/careers/northamerica">North America</a>, <a href="/careers/middleeastafrica">Middle East Africa</a>, <a href="/careers/europe">Europe</a>, <a href="/careers/latinamerica">Latin America</a>, <a href="/careers/asia">Asia</a>, <a href="/diversity/manager">Diversity Director</a>]
>>>

it returns all the links with second character "a" but not the lawyer names. So for some reason those links (such as "/cabel") are not visible to BeautifulSoup. I don't understand why.

解决方案

I know BeautifulSoup is the canonical HTML parsing module, but sometimes you just want to scrape out some substrings from some HTML, and pyparsing has some useful methods to do this. Using this code:

from pyparsing import makeHTMLTags, withAttribute, SkipTo
import urllib

# get the HTML from your URL
url = "http://www.whitecase.com/Attorneys/List.aspx?LastName=&FirstName="
page = urllib.urlopen(url)
html = page.read()
page.close()

# define opening and closing tag expressions for <td> and <a> tags
# (makeHTMLTags also comprehends tag variations, including attributes, 
# upper/lower case, etc.)
tdStart,tdEnd = makeHTMLTags("td")
aStart,aEnd = makeHTMLTags("a")

# only interested in tdStarts if they have "class=altRow" attribute
tdStart.setParseAction(withAttribute(("class","altRow")))

# compose total matching pattern (add trailing tdStart to filter out 
# extraneous <td> matches)
patt = tdStart + aStart("a") + SkipTo(aEnd)("text") + aEnd + tdEnd + tdStart

# scan input HTML source for matching refs, and print out the text and 
# href values
for ref,s,e in patt.scanString(html):
    print ref.text, ref.a.href

I extracted 914 references from your page, from Abel to Zupikova.

Abel, Christian /cabel
Acevedo, Linda Jeannine /jacevedo
Acuña, Jennifer /jacuna
Adeyemi, Ike /igbadegesin
Adler, Avraham /aadler
...
Zhu, Jie /jzhu
Zídek, Aleš /azidek
Ziółek, Agnieszka /aziolek
Zitter, Adam /azitter
Zupikova, Jana /jzupikova

这篇关于如何将此 XPath 表达式转换为 BeautifulSoup?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆