解析使用beautifulsoup HTML页面 [英] Parsing HTML page using beautifulsoup

查看：172 发布时间：2016/8/5 18:58:14 python html beautifulsoup

本文介绍了解析使用beautifulsoup HTML页面的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我开始beautifulsoup工作解析HTML。结果
对于例如，对于网站 http://en.wikipedia.org/wiki/PLCB1

I started working on beautifulsoup for parsing HTML.
for eg for site "http://en.wikipedia.org/wiki/PLCB1"

import sys
sys.setrecursionlimit(10000)

import urllib2, sys
from BeautifulSoup import BeautifulSoup

site= "http://en.wikipedia.org/wiki/PLCB1"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(site,headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)

table = soup.find('table', {'class':'infobox'})
#print table
rows = table.findAll("th")
for x in rows:
    print "x - ", x.string

我收到在日那里是URL某些情况下，如无输出。为什么会这样？

I am getting output as None in some cases of th where there is url. why it is so?

输出：

x -  Phospholipase C, beta 1 (phosphoinositide-specific)
x -  Identifiers
x -  None
x -  External IDs
x -  None
x -  None
x -  Molecular function
x -  Cellular component
x -  Biological process
x -  RNA expression pattern
x -  Orthologs
x -  Species
x -  None
x -  None
x -  None
x -  RefSeq (mRNA)
x -  RefSeq (protein)
x -  Location (UCSC)
x -  None

例如，位置后，还有一个个包含考研搜索，但显示为无。我想知道为什么它的发生。

for example, after Location, there is one more th which contains "pubmed search" but appearing as None. I want to know why its happening.

和结果
第二：有没有办法让词典日和各自的TD，这样就很容易解析

and
second : is there way to get th and respective td in dictionary so that it becomes easy to parse?

推荐答案

Element.string 只包含一个值，如果有文字的直接在元素的。嵌套的元素不包括在内。

Element.string only contains a value if there is text directly in the element. Nested elements are not included.

如果您使用的是BeautifulSoup 4，使用 Element.stripped_strings 来代替：

If you are using BeautifulSoup 4, use Element.stripped_strings instead:

print ''.join(x.stripped_strings)

有关BeautifulSoup 3，您需要将搜索所有文本元素：

For BeautifulSoup 3, you'll need to search for all text elements:

print ''.join([unicode(t).strip() for t in x.findAll(text=True)])

如果要合并百分位＆GT; 和＆LT; TD＆GT; 元素融入一个字典，你' d必须对所有循环百分位＆GT; 元素，然后用 .findNextSibling（）来找到对应的＆LT; TD＆GT; 元素，并结合与上述 .findAll（文= TRUE）诱骗建立自己的字典：

If you want to combine <th> and <td> elements into a dictionary, you'd have loop over all <th> elements, then use .findNextSibling() to locate the corresponding <td> element, and combine that with the above .findAll(text=True) trick to build yourself a dictionary:

info = {}
rows = table.findAll("th")
for headercell in rows:
    valuecell = headercell.findNextSibling('td')
    if valuecell is None:
        continue
    header = ''.join([unicode(t).strip() for t in headercell.findAll(text=True)])
    value = ''.join([unicode(t).strip() for t in valuecell.findAll(text=True)])
    info[header] = value

这篇关于解析使用beautifulsoup HTML页面的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

解析使用beautifulsoup HTML页面 [英] Parsing HTML page using beautifulsoup

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

解析使用beautifulsoup HTML页面 [英] Parsing HTML page using beautifulsoup

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭