beautifulsoup和无效的html文档 [英] beautifulsoup and invalid html document

查看：122 发布时间：2018/6/21 12:43:07 python html parsing html-parsing beautifulsoup

本文介绍了beautifulsoup和无效的html文档的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图解析文档 http://www.consilium.europa.eu/uedocs/cms_data/docs/pressdata/en/ecofin/acf8e.htm 。
我想在文档的开始处获取国家和地名。

这是我的代码

 导入urllib 
从bs4导入re 
导入BeautifulSoup 
 url =http://www.consilium.europa.eu/uedocs/cms_data/ docs / pressdata / en / ecofin / acf8e.htm
 soup = BeautifulSoup（urllib.urlopen（url））
 attendances_table = soup.find（table，{width：850}）
 print attendances_table #this works，I see the whole table 
 print attendances_table.find_all（tr）

出现以下错误：

  AttributeError：'NoneType'对象没有属性'next_element'

然后我尝试使用与本文相同的解决方案（我知道，再次，我：p）：
包含无效HTML文档的美丽优惠

我换了一行：

  soup = B 
 
 
 $ b $ p 
 $ $ b 
 返回BeautifulSoup（html，'html.parser'）
  
现在如果我这样做： 
 
 
  print attendances_table 
  
我只会得到：
 
 
 < table border =0 cellpadding =10cellspacing =0width =850> 
< tr>< td valign =TOPwidth =42％> 
< p>< b>< u>比利时< / u>< / b>< / p>< / td>< / tr>< / table> 
  
我应该改变什么？ 
 
解决方案解决方案！ 
 
 我刚刚使用了另一个解析器库 lxml 。 
谢谢Martijn Pieters！
 汤= BeautifulSoup（urllib.urlopen（url），'lxml'） 
    lxml 是唯一适用于我！ 
I am trying to parse the document http://www.consilium.europa.eu/uedocs/cms_data/docs/pressdata/en/ecofin/acf8e.htm.
I want to get countries and names at the beginning of the document.

Here is my code
import urllib
import re
from bs4 import BeautifulSoup
url="http://www.consilium.europa.eu/uedocs/cms_data/docs/pressdata/en/ecofin/acf8e.htm"
soup=BeautifulSoup(urllib.urlopen(url))
attendances_table=soup.find("table", {"width":850})
print attendances_table #this works, I see the whole table
print attendances_table.find_all("tr")
I get the following error:
AttributeError: 'NoneType' object has no attribute 'next_element'
I then tried to use the same solution as in this post (I know, again, me :p) :
beautifulsoup with an invalid html document

I replaced the line:
soup=BeautifulSoup(urllib.urlopen(url))
with:
return BeautifulSoup(html, 'html.parser')
Now if I do:
print attendances_table
I only get:
<table border="0" cellpadding="10" cellspacing="0" width="850">
<tr><td valign="TOP" width="42%">
<p><b><u>Belgium</u></b></p></td></tr></table>
What should I change?
 解决方案 
Solved!

I just used another parser library, lxml.
Thank you Martijn Pieters for that!
soup = BeautifulSoup(urllib.urlopen(url), 'lxml')
lxml was the only library that worked for me!

                        这篇关于beautifulsoup和无效的html文档的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！
                        
                    

                    
                        查看全文

beautifulsoup和无效的html文档 [英] beautifulsoup and invalid html document

问题描述

Solved!

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

beautifulsoup和无效的html文档 [英] beautifulsoup and invalid html document

问题描述

Solved!

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭