Beautifulsoup失去节点 [英] Beautifulsoup lost nodes
问题描述
我使用Python和Beautifulsoup解析HTML的数据,并获得对标签进行RSS源的。然而,一些网址导致问题,因为所解析的汤对象不包括文档的所有节点
例如我试图解析的http://feeds.chicagotribune.com/~r/ChicagoBreakingNews/~3/T2Zg3dk4L88/story01.htm
不过,分析对象与页面源$ C $ C比较后,我注意到,在所有节点UL类=次世代遗
失踪。
下面是我如何解析文件:
从BS4进口BeautifulSoup作为BSURL ='http://feeds.chicagotribune.com/~r/ChicagoBreakingNews/~3/T2Zg3dk4L88/story01.htmCJ = cookielib.CookieJar()
首战= urllib2.build_opener(urllib2.HTTPCookieProcessor(CJ))
请求= urllib2.Request(URL)响应= opener.open(要求)汤= BS(响应,LXML')
打印汤
输入HTML是不太符合的,所以你必须在这里使用一个不同的解析器。在 html5lib
解析器正确处理这个页面:
>>>进口要求
>>>从BS4进口BeautifulSoup
>>> R = requests.get('http://feeds.chicagotribune.com/~r/ChicagoBreakingNews/~3/T2Zg3dk4L88/story01.htm')
>>>汤= BeautifulSoup(r.text,'LXML')
>>> soup.find('DIV',ID ='故事体')不是无
假
>>>汤= BeautifulSoup(r.text,HTML5)
>>> soup.find('DIV',ID ='故事体')不是无
真正
I am using Python and Beautifulsoup to parse HTML-Data and get p-tags out of RSS-Feeds. However, some urls cause problems because the parsed soup-object does not include all nodes of the document.
For example I tried to parse http://feeds.chicagotribune.com/~r/ChicagoBreakingNews/~3/T2Zg3dk4L88/story01.htm
But after comparing the parsed object with the pages source code, I noticed that all nodes after ul class="nextgen-left"
are missing.
Here is how I parse the Documents:
from bs4 import BeautifulSoup as bs
url = 'http://feeds.chicagotribune.com/~r/ChicagoBreakingNews/~3/T2Zg3dk4L88/story01.htm'
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
request = urllib2.Request(url)
response = opener.open(request)
soup = bs(response,'lxml')
print soup
The input HTML is not quite conformant, so you'll have to use a different parser here. The html5lib
parser handles this page correctly:
>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://feeds.chicagotribune.com/~r/ChicagoBreakingNews/~3/T2Zg3dk4L88/story01.htm')
>>> soup = BeautifulSoup(r.text, 'lxml')
>>> soup.find('div', id='story-body') is not None
False
>>> soup = BeautifulSoup(r.text, 'html5')
>>> soup.find('div', id='story-body') is not None
True
这篇关于Beautifulsoup失去节点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!