Beautifulsoup失去节点 [英] Beautifulsoup lost nodes

查看：176 发布时间：2016/8/5 19:05:48 python beautifulsoup html5lib

本文介绍了Beautifulsoup失去节点的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用Python和Beautifulsoup解析HTML的数据，并获得对标签进行RSS源的。然而，一些网址导致问题，因为所解析的汤对象不包括文档的所有节点

例如我试图解析的http://feeds.chicagotribune.com/~r/ChicagoBreakingNews/~3/T2Zg3dk4L88/story01.htm

不过，分析对象与页面源$ C $ C比较后，我注意到，在所有节点UL类=次世代遗失踪。

下面是我如何解析文件：

 从BS4进口BeautifulSoup作为BSURL ='http://feeds.chicagotribune.com/~r/ChicagoBreakingNews/~3/T2Zg3dk4L88/story01.htmCJ = cookielib.CookieJar（）
首战= urllib2.build_opener（urllib2.HTTPCookieProcessor（CJ））
请求= urllib2.Request（URL）响应= opener.open（要求）汤= BS（响应，LXML'）
打印汤

解决方案

输入HTML是不太符合的，所以你必须在这里使用一个不同的解析器。在 html5lib 解析器正确处理这个页面：

 ＆GT;＆GT;＆GT;进口要求
＆GT;＆GT;＆GT;从BS4进口BeautifulSoup
＆GT;＆GT;＆GT; R = requests.get（'http://feeds.chicagotribune.com/~r/ChicagoBreakingNews/~3/T2Zg3dk4L88/story01.htm'）
＆GT;＆GT;＆GT;汤= BeautifulSoup（r.text，'LXML'）
＆GT;＆GT;＆GT; soup.find（'DIV'，ID ='故事体'）不是无
假
＆GT;＆GT;＆GT;汤= BeautifulSoup（r.text，HTML5）
＆GT;＆GT;＆GT; soup.find（'DIV'，ID ='故事体'）不是无
真正

I am using Python and Beautifulsoup to parse HTML-Data and get p-tags out of RSS-Feeds. However, some urls cause problems because the parsed soup-object does not include all nodes of the document.

For example I tried to parse http://feeds.chicagotribune.com/~r/ChicagoBreakingNews/~3/T2Zg3dk4L88/story01.htm

But after comparing the parsed object with the pages source code, I noticed that all nodes after ul class="nextgen-left" are missing.

Here is how I parse the Documents:

from bs4 import BeautifulSoup as bs

url = 'http://feeds.chicagotribune.com/~r/ChicagoBreakingNews/~3/T2Zg3dk4L88/story01.htm'

cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
request = urllib2.Request(url)

response = opener.open(request) 

soup = bs(response,'lxml')        
print soup

解决方案

The input HTML is not quite conformant, so you'll have to use a different parser here. The html5lib parser handles this page correctly:

>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://feeds.chicagotribune.com/~r/ChicagoBreakingNews/~3/T2Zg3dk4L88/story01.htm')
>>> soup = BeautifulSoup(r.text, 'lxml')
>>> soup.find('div', id='story-body') is not None
False
>>> soup = BeautifulSoup(r.text, 'html5')
>>> soup.find('div', id='story-body') is not None
True

这篇关于Beautifulsoup失去节点的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Beautifulsoup失去节点 [英] Beautifulsoup lost nodes

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Beautifulsoup失去节点 [英] Beautifulsoup lost nodes

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭