Python - beautifulsoup - 如何处理丢失的结束标记 [英] Python - beautifulsoup - how to deal with missing closing tags
问题描述
我想使用beautifulsoup从html代码中删除表格。 html的片段如下所示。当使用 table.findAll('tr')
时,我得到整个表而不仅仅是行。 (可能是因为html代码中缺少结束标记?)
I would like to scrape the table from html code using beautifulsoup. A snippet of the html is shown below. When using table.findAll('tr')
I get the entire table and not only the rows. (probably because the closing tags are missing from the html code?)
<TABLE COLS=9 BORDER=0 CELLSPACING=3 CELLPADDING=0>
<TR><TD><B>Artikelbezeichnung</B>
<TD><B>Anbieter</B>
<TD><B>Menge</B>
<TD><B>Taxe-EK</B>
<TD><B>Taxe-VK</B>
<TD><B>Empf.-VK</B>
<TD><B>FB</B>
<TD><B>PZN</B>
<TD><B>Nachfolge</B>
<TR><TD>ACTIQ 200 Mikrogramm Lutschtabl.m.integr.Appl.
<TD>Orifarm
<TD ID=R> 30 St
<TD ID=R> 266,67
<TD ID=R> 336,98
<TD>
<TD>
<TD>12516714
<TD>
</TABLE>
这是我的python代码,用于显示我正在努力解决的问题:
Here is my python code to show what I am struggling with:
soup = BeautifulSoup(data, "html.parser")
table = soup.findAll("table")[0]
rows = table.find_all('tr')
for tr in rows:
print(tr.text)
推荐答案
如文档 html5lib
在Web浏览器中解析文档(如 lxml
在这种情况下)。它会尝试通过在需要时添加/关闭标记来修复文档树。
As stated in their documentation html5lib
parses the document as the web browser does (Like lxml
in this case). It'll try to fix your document tree by adding/closing tags when needed.
在您的示例中,我使用了lxml作为解析器,它给出了以下结果:
In your example I've used lxml as the parser and it gave the following result:
soup = BeautifulSoup(data, "lxml")
table = soup.findAll("table")[0]
rows = table.find_all('tr')
for tr in rows:
print(tr.get_text(strip=True))
请注意 lxml
添加了html& body标签,因为它们不存在于源代码中(它将尝试按以前的状态创建格式良好的文档)。
Note that lxml
added html & body tags because they weren't present in the source (It'll try to create a well formed document as previously state).
这篇关于Python - beautifulsoup - 如何处理丢失的结束标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!