使用 beautifulsoup 解析 HTML 页面 [英] Parsing HTML page using beautifulsoup
问题描述
我开始使用 beautifulsoup 来解析 HTML.
例如对于站点http://en.wikipedia.org/wiki/PLCB1">
I started working on beautifulsoup for parsing HTML.
for eg for site "http://en.wikipedia.org/wiki/PLCB1"
import sys
sys.setrecursionlimit(10000)
import urllib2, sys
from BeautifulSoup import BeautifulSoup
site= "http://en.wikipedia.org/wiki/PLCB1"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(site,headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
table = soup.find('table', {'class':'infobox'})
#print table
rows = table.findAll("th")
for x in rows:
print "x - ", x.string
在某些有 url 的情况下,我得到的输出为 None.为什么会这样?
I am getting output as None in some cases of th where there is url. why it is so?
输出:
x - Phospholipase C, beta 1 (phosphoinositide-specific)
x - Identifiers
x - None
x - External IDs
x - None
x - None
x - Molecular function
x - Cellular component
x - Biological process
x - RNA expression pattern
x - Orthologs
x - Species
x - None
x - None
x - None
x - RefSeq (mRNA)
x - RefSeq (protein)
x - Location (UCSC)
x - None
例如,在Location之后,还有一个包含pubmed search"但显示为None.我想知道它为什么会发生.
for example, after Location, there is one more th which contains "pubmed search" but appearing as None. I want to know why its happening.
和
第二:有没有办法在字典中获取 th 和各自的 td 以便于解析?
and
second : is there way to get th and respective td in dictionary so that it becomes easy to parse?
推荐答案
Element.string
仅在元素中直接存在文本时才包含一个值.不包括嵌套元素.
Element.string
only contains a value if there is text directly in the element. Nested elements are not included.
如果您使用的是 BeautifulSoup 4,请使用 Element.stripped_strings
代替:
If you are using BeautifulSoup 4, use Element.stripped_strings
instead:
print ''.join(x.stripped_strings)
对于 BeautifulSoup 3,您需要搜索所有文本元素:
For BeautifulSoup 3, you'll need to search for all text elements:
print ''.join([unicode(t).strip() for t in x.findAll(text=True)])
如果你想将 If you want to combine 这篇关于使用 beautifulsoup 解析 HTML 页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋! 和 元素组合到一个字典中,你需要循环遍历所有 ; 元素,然后使用 .findNextSibling()
定位对应的 元素,并将其与上面的 .findAll(text=真的)
为自己建立字典的技巧:
<th>
and <td>
elements into a dictionary, you'd have loop over all <th>
elements, then use .findNextSibling()
to locate the corresponding <td>
element, and combine that with the above .findAll(text=True)
trick to build yourself a dictionary:info = {}
rows = table.findAll("th")
for headercell in rows:
valuecell = headercell.findNextSibling('td')
if valuecell is None:
continue
header = ''.join([unicode(t).strip() for t in headercell.findAll(text=True)])
value = ''.join([unicode(t).strip() for t in valuecell.findAll(text=True)])
info[header] = value
登录
关闭