为什么 BeautifulSoup 找不到特定的表类? [英] Why is BeautifulSoup not finding a specific table class?
问题描述
我正在使用 Beautiful Soup 尝试从 Oil-Price.net 上清除商品表.我可以找到第一个 div、table、table body 和 table body 的行.但是我在使用 Beautiful Soup 时找不到其中一行中的一列.当我告诉 python 打印该特定行中的所有表时,它没有显示我想要的表.这是我的代码:
from urllib2 import urlopen从 bs4 导入 BeautifulSouphtml = urlopen('http://oil-price.net').read()汤 = BeautifulSoup(html)div = 汤.find("div",{"id":"cntPos"})table1 = div.find("table",{"class":"cntTb"})tb1_body = table1.find("tbody")tb1_rows = tb1_body.find_all("tr")tb1_row = tb1_rows[1]td = tb1_row.find("td",{"class":"cntBoxGreyLnk"})打印 td
它打印的所有内容都是无.我什至尝试打印每一行,看看我是否可以手动找到该列而什么也找不到.``它会显示给其他人.但不是我想要的.
页面使用损坏的 HTML,不同的解析器会尝试以不同的方式修复它.安装 lxml
解析器,它可以更好地解析该页面:
这并不意味着 lxml
会比其他解析器选项更好地处理所有损坏的 HTML.另请查看 html5lib
,这是 html5lib
的纯 Python 实现a href="https://html.spec.whatwg.org/multipage/" rel="nofollow noreferrer">WHATWG HTML 规范,因此更接近当前浏览器实现如何处理损坏的 HTML.
I am using Beautiful Soup to try and scrape the Commodities table off of Oil-Price.net. I can find the first div, table, table body, and the rows of the table body. But there is a column in one of the rows that I can't find using Beautiful soup. When I tell python to print all the tables in that particular row, it doesn't show the one I want. This is my code:
from urllib2 import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://oil-price.net').read()
soup = BeautifulSoup(html)
div = soup.find("div",{"id":"cntPos"})
table1 = div.find("table",{"class":"cntTb"})
tb1_body = table1.find("tbody")
tb1_rows = tb1_body.find_all("tr")
tb1_row = tb1_rows[1]
td = tb1_row.find("td",{"class":"cntBoxGreyLnk"})
print td
All it prints is None. I even try to print each of the rows to see if I can find the column manually and nothing. ``It will show others. But not the one I want.
The page uses broken HTML, and different parsers will try to repair it differently. Install the lxml
parser, it parses that page better:
>>> BeautifulSoup(html, 'html.parser').find("div",{"id":"cntPos"}).find("table",{"class":"cntTb"}).tbody.find_all("tr")[1].find("td",{"class":"cntBoxGreyLnk"}) is None
True
>>> BeautifulSoup(html, 'lxml').find("div",{"id":"cntPos"}).find("table",{"class":"cntTb"}).tbody.find_all("tr")[1].find("td",{"class":"cntBoxGreyLnk"}) is None
False
This doesn't mean that lxml
will handle all broken HTML better than the other parser options. Also look at html5lib
, a pure-Python implementation of the WHATWG HTML spec and thus more closely follows how current browser implementations handle broken HTML.
这篇关于为什么 BeautifulSoup 找不到特定的表类?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!