为什么 BeautifulSoup 找不到特定的表类? [英] Why is BeautifulSoup not finding a specific table class?

查看:27
本文介绍了为什么 BeautifulSoup 找不到特定的表类?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Beautiful Soup 尝试从 Oil-Price.net 上清除商品表.我可以找到第一个 div、table、table body 和 table body 的行.但是我在使用 Beautiful Soup 时找不到其中一行中的一列.当我告诉 python 打印该特定行中的所有表时,它没有显示我想要的表.这是我的代码:

from urllib2 import urlopen从 bs4 导入 BeautifulSouphtml = urlopen('http://oil-price.net').read()汤 = BeautifulSoup(html)div = 汤.find("div",{"id":"cntPos"})table1 = div.find("table",{"class":"cntTb"})tb1_body = table1.find("tbody")tb1_rows = tb1_body.find_all("tr")tb1_row = tb1_rows[1]td = tb1_row.find("td",{"class":"cntBoxGreyLnk"})打印 td

它打印的所有内容都是无.我什至尝试打印每一行,看看我是否可以手动找到该列而什么也找不到.``它会显示给其他人.但不是我想要的.

解决方案

页面使用损坏的 HTML,不同的解析器会尝试以不同的方式修复它.安装 lxml 解析器,它可以更好地解析该页面:

<预><代码>>>>BeautifulSoup(html, 'html.parser').find("div",{"id":"cntPos"}).find("table",{"class":"cntTb"}).tbody.find_all("tr")[1].find("td",{"class":"cntBoxGreyLnk"}) 是 None真的>>>BeautifulSoup(html, 'lxml').find("div",{"id":"cntPos"}).find("table",{"class":"cntTb"}).tbody.find_all("tr")[1].find("td",{"class":"cntBoxGreyLnk"}) 是 None错误的

这并不意味着 lxml 会比其他解析器选项更好地处理所有损坏的 HTML.另请查看 html5lib,这是 html5lib 的纯 Python 实现a href="https://html.spec.whatwg.org/multipage/" rel="nofollow noreferrer">WHATWG HTML 规范,因此更接近当前浏览器实现如何处理损坏的 HTML.

I am using Beautiful Soup to try and scrape the Commodities table off of Oil-Price.net. I can find the first div, table, table body, and the rows of the table body. But there is a column in one of the rows that I can't find using Beautiful soup. When I tell python to print all the tables in that particular row, it doesn't show the one I want. This is my code:

from urllib2 import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://oil-price.net').read()
soup = BeautifulSoup(html)

div = soup.find("div",{"id":"cntPos"})
table1 = div.find("table",{"class":"cntTb"})
tb1_body = table1.find("tbody")
tb1_rows = tb1_body.find_all("tr")
tb1_row = tb1_rows[1]
td = tb1_row.find("td",{"class":"cntBoxGreyLnk"})
print td

All it prints is None. I even try to print each of the rows to see if I can find the column manually and nothing. ``It will show others. But not the one I want.

解决方案

The page uses broken HTML, and different parsers will try to repair it differently. Install the lxml parser, it parses that page better:

>>> BeautifulSoup(html, 'html.parser').find("div",{"id":"cntPos"}).find("table",{"class":"cntTb"}).tbody.find_all("tr")[1].find("td",{"class":"cntBoxGreyLnk"}) is None
True
>>> BeautifulSoup(html, 'lxml').find("div",{"id":"cntPos"}).find("table",{"class":"cntTb"}).tbody.find_all("tr")[1].find("td",{"class":"cntBoxGreyLnk"}) is None
False

This doesn't mean that lxml will handle all broken HTML better than the other parser options. Also look at html5lib, a pure-Python implementation of the WHATWG HTML spec and thus more closely follows how current browser implementations handle broken HTML.

这篇关于为什么 BeautifulSoup 找不到特定的表类?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆