beautifulsoup和无效的html文档 [英] beautifulsoup and invalid html document
问题描述
我试图解析文档 http://www.consilium.europa.eu/uedocs/cms_data/docs/pressdata/en/ecofin/acf8e.htm 。
我想在文档的开始处获取国家和地名。
这是我的代码
导入urllib
从bs4导入re
导入BeautifulSoup
url =http://www.consilium.europa.eu/uedocs/cms_data/ docs / pressdata / en / ecofin / acf8e.htm
soup = BeautifulSoup(urllib.urlopen(url))
attendances_table = soup.find(table,{width:850})
print attendances_table #this works,I see the whole table
print attendances_table.find_all(tr)
出现以下错误:
AttributeError:'NoneType'对象没有属性'next_element'
然后我尝试使用与本文相同的解决方案(我知道,再次,我:p):
包含无效HTML文档的美丽优惠
我换了一行:
soup = B
$ b $ p
$ $ b 返回BeautifulSoup(html,'html.parser')
现在如果我这样做:
print attendances_table
我只会得到:
< table border =0 cellpadding =10cellspacing =0width =850>
< tr>< td valign =TOPwidth =42%>
< p>< b>< u>比利时< / u>< / b>< / p>< / td>< / tr>< / table>
我应该改变什么?
解决方案解决方案!
我刚刚使用了另一个解析器库 lxml
。
谢谢Martijn Pieters!
汤= BeautifulSoup(urllib.urlopen(url),'lxml')
lxml
是唯一适用于我!
I am trying to parse the document http://www.consilium.europa.eu/uedocs/cms_data/docs/pressdata/en/ecofin/acf8e.htm.
I want to get countries and names at the beginning of the document.
Here is my code
import urllib
import re
from bs4 import BeautifulSoup
url="http://www.consilium.europa.eu/uedocs/cms_data/docs/pressdata/en/ecofin/acf8e.htm"
soup=BeautifulSoup(urllib.urlopen(url))
attendances_table=soup.find("table", {"width":850})
print attendances_table #this works, I see the whole table
print attendances_table.find_all("tr")
I get the following error:
AttributeError: 'NoneType' object has no attribute 'next_element'
I then tried to use the same solution as in this post (I know, again, me :p) :
beautifulsoup with an invalid html document
I replaced the line:
soup=BeautifulSoup(urllib.urlopen(url))
with:
return BeautifulSoup(html, 'html.parser')
Now if I do:
print attendances_table
I only get:
<table border="0" cellpadding="10" cellspacing="0" width="850">
<tr><td valign="TOP" width="42%">
<p><b><u>Belgium</u></b></p></td></tr></table>
What should I change?
解决方案 Solved!
I just used another parser library, lxml
.
Thank you Martijn Pieters for that!
soup = BeautifulSoup(urllib.urlopen(url), 'lxml')
lxml
was the only library that worked for me!
这篇关于beautifulsoup和无效的html文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文