beautifulsoup和无效的html文档 [英] beautifulsoup and invalid html document

查看:122
本文介绍了beautifulsoup和无效的html文档的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图解析文档 http://www.consilium.europa.eu/uedocs/cms_data/docs/pressdata/en/ecofin/acf8e.htm
我想在文档的开始处获取国家和地名。



这是我的代码

 导入urllib 
从bs4导入re
导入BeautifulSoup
url =http://www.consilium.europa.eu/uedocs/cms_data/ docs / pressdata / en / ecofin / acf8e.htm
soup = BeautifulSoup(urllib.urlopen(url))
attendances_table = soup.find(table,{width:850})
print attendances_table #this works,I see the whole table
print attendances_table.find_all(tr)

出现以下错误:

  AttributeError:'NoneType'对象没有属性'next_element'

然后我尝试使用与本文相同的解决方案(我知道,再次,我:p):
包含无效HTML文档的美丽优惠



我换了一行:

  soup = B 


$ b $ p
$ $ b

 返回BeautifulSoup(html,'html.parser')

现在如果我这样做:

  print attendances_table 

我只会得到:

 < table border =0 cellpadding =10cellspacing =0width =850> 
< tr>< td valign =TOPwidth =42%>
< p>< b>< u>比利时< / u>< / b>< / p>< / td>< / tr>< / table>

我应该改变什么?

解决方案解决方案!

我刚刚使用了另一个解析器库 lxml
谢谢Martijn Pieters!

 汤= BeautifulSoup(urllib.urlopen(url),'lxml') 

lxml 是唯一适用于我!

I am trying to parse the document http://www.consilium.europa.eu/uedocs/cms_data/docs/pressdata/en/ecofin/acf8e.htm. I want to get countries and names at the beginning of the document.

Here is my code

import urllib
import re
from bs4 import BeautifulSoup
url="http://www.consilium.europa.eu/uedocs/cms_data/docs/pressdata/en/ecofin/acf8e.htm"
soup=BeautifulSoup(urllib.urlopen(url))
attendances_table=soup.find("table", {"width":850})
print attendances_table #this works, I see the whole table
print attendances_table.find_all("tr")

I get the following error:

AttributeError: 'NoneType' object has no attribute 'next_element'

I then tried to use the same solution as in this post (I know, again, me :p) : beautifulsoup with an invalid html document

I replaced the line:

soup=BeautifulSoup(urllib.urlopen(url))

with:

return BeautifulSoup(html, 'html.parser')

Now if I do:

print attendances_table

I only get:

<table border="0" cellpadding="10" cellspacing="0" width="850">
<tr><td valign="TOP" width="42%">
<p><b><u>Belgium</u></b></p></td></tr></table>

What should I change?

解决方案

Solved!

I just used another parser library, lxml. Thank you Martijn Pieters for that!

soup = BeautifulSoup(urllib.urlopen(url), 'lxml')

lxml was the only library that worked for me!

这篇关于beautifulsoup和无效的html文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆