从 html 页面中删除所有样式、脚本和 html 标签 [英] Remove all style, scripts, and html tags from an html page

查看：37 发布时间：2021/12/23 20:40:58 python html beautifulsoup

本文介绍了从 html 页面中删除所有样式、脚本和 html 标签的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这是我目前所拥有的:

from bs4 import BeautifulSoup

def cleanme(html):
    soup = BeautifulSoup(html) # create a new bs4 object from the html data loaded
    for script in soup(["script"]): 
        script.extract()
    text = soup.get_text()
    return text
testhtml = "<!DOCTYPE HTML>
<head>
<title>THIS IS AN EXAMPLE </title><style>.call {font-family:Arial;}</style><script>getit</script><body>I need this text captured<h1>And this</h1></body>"

cleaned = cleanme(testhtml)
print (cleaned)

这是为了删除脚本

推荐答案

看起来你几乎拥有它.您还需要删除 html 标签和 css 样式代码.这是我的解决方案(我更新了功能):

It looks like you almost have it. You need to also remove the html tags and css styling code. Here is my solution (I updated the function):

def cleanMe(html):
    soup = BeautifulSoup(html, "html.parser") # create a new bs4 object from the html data loaded
    for script in soup(["script", "style"]): # remove all javascript and stylesheet code
        script.extract()
    # get text
    text = soup.get_text()
    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '
'.join(chunk for chunk in chunks if chunk)
    return text

这篇关于从 html 页面中删除所有样式、脚本和 html 标签的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从 html 页面中删除所有样式、脚本和 html 标签 [英] Remove all style, scripts, and html tags from an html page

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

从 html 页面中删除所有样式、脚本和 html 标签 [英] Remove all style, scripts, and html tags from an html page

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭