删除评论标签,但对BeautifulSoup不满意 [英] Remove comment tag but NOT content with BeautifulSoup
问题描述
我正在使用BeautifulSoup练习一些网络抓取,特别是我正在查看NFL游戏数据,更具体地说是在此页面上的"Team Stats"表(
I'm practicing some web scraping using BeautifulSoup, specifically I'm looking at NFL game data and more specifically the "Team Stats" table on this page (https://www.pro-football-reference.com/boxscores/201809060phi.htm).
当查看表格的HTML时,我会看到类似这样的内容:
When looking at the HTML for the table I see something like this:
<div class="section_heading">...</div>
<div class="placeholder"></div>
<!--
<div class="table_outer_container">
<div class="overthrow table_container" id="div_team_stats">
<table class="stats_table" id="team_stats" data-cols-to-freeze=1>
....
</table>
</div>
</div>
-->
本质上,呈现给页面的HTML作为注释存储在HTML中,因此我可以找到表的div,但是BeautifulSoup无法解析表本身,因为它全部在注释中.
Essentially, the HTML that is being rendered to the page is stored in the HTML as a comment, so I can find the div for the table but BeautifulSoup can't parse the table itself because it's all in the comment.
是否有解决此问题的好方法,以便可以使用BeautifulSoup解析表HTML?我想出了如何提取注释文本,但是我不知道是否存在将结果String转换为可用HTML的好方法.另外,评论标签可以简单地删除,我认为可以将其解析为HTML,但是我也没有找到一种很好的方法.
Is there a good way to get around this so I can parse the table HTML with BeautifulSoup? I figured out how to extract the comment text, but I don't know if there's a good way to convert the resulting String into usable HTML. Alternatively the comment tags could simply be removed which I think would let it be parsed as HTML, but I haven't found a good way to do that either.
推荐答案
from bs4 import BeautifulSoup, Comment
for comments in soup.findAll(text=lambda text:isinstance(text, Comment)):
comments.extract()
由此,您将能够取出所有注释,并在注释之间插入文本,并将其放入BS4中以提取其中的数据.希望这行得通.
From this you will be able to get all the comments out and get the text in between comments and put it in the BS4 to extract data within. Hope this works.
这篇关于删除评论标签,但对BeautifulSoup不满意的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!