删除评论标签但不满足 BeautifulSoup [英] Remove comment tag but NOT content with BeautifulSoup

查看:23
本文介绍了删除评论标签但不满足 BeautifulSoup的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 BeautifulSoup 练习一些网页抓取,特别是我正在查看 NFL 比赛数据,更具体地说是此页面上的球队统计数据"表(https://www.pro-football-reference.com/boxscores/201809060phi.htm).

查看表格的 HTML 时,我看到如下内容:

...

<div class="placeholder"></div><!--<div class="table_outer_container"><div class="overthrow table_container" id="div_team_stats"><table class="stats_table" id="team_stats" data-cols-to-freeze=1>....

-->

本质上,呈现给页面的 HTML 作为注释存储在 HTML 中,因此我可以找到表格的 div,但 BeautifulSoup 无法解析表格本身,因为它都在评论中.

有什么好方法可以解决这个问题,以便我可以使用 BeautifulSoup 解析表格 HTML?我想出了如何提取评论文本,但我不知道是否有将结果字符串转换为可用 HTML 的好方法.或者,可以简单地删除评论标签,我认为这可以让它被解析为 HTML,但我也没有找到一个好的方法.

解决方案

from bs4 import BeautifulSoup, Comment对于soup.findAll(text=lambda text:isinstance(text, Comment)) 中的评论:评论.extract()

由此,您将能够获取所有评论并获取评论之间的文本并将其放入 BS4 以提取其中的数据.希望这有效.

I'm practicing some web scraping using BeautifulSoup, specifically I'm looking at NFL game data and more specifically the "Team Stats" table on this page (https://www.pro-football-reference.com/boxscores/201809060phi.htm).

When looking at the HTML for the table I see something like this:

<div class="section_heading">...</div>
<div class="placeholder"></div>
<!--
    <div class="table_outer_container">
        <div class="overthrow table_container" id="div_team_stats">
            <table class="stats_table" id="team_stats" data-cols-to-freeze=1>
                ....
            </table>
        </div>
    </div>
-->

Essentially, the HTML that is being rendered to the page is stored in the HTML as a comment, so I can find the div for the table but BeautifulSoup can't parse the table itself because it's all in the comment.

Is there a good way to get around this so I can parse the table HTML with BeautifulSoup? I figured out how to extract the comment text, but I don't know if there's a good way to convert the resulting String into usable HTML. Alternatively the comment tags could simply be removed which I think would let it be parsed as HTML, but I haven't found a good way to do that either.

解决方案

from bs4 import BeautifulSoup, Comment
for comments in soup.findAll(text=lambda text:isinstance(text, Comment)):
    comments.extract()

From this you will be able to get all the comments out and get the text in between comments and put it in the BS4 to extract data within. Hope this works.

这篇关于删除评论标签但不满足 BeautifulSoup的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
前端开发最新文章
热门教程
热门工具
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆