Python代码从字符串中删除HTML标签 [英] Python code to remove HTML tags from a string

查看：1572 发布时间：2018/6/13 17:34:11 python html xml string parsing

本文介绍了Python代码从字符串中删除HTML标签的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有这样的文字：

I have a text like this:

text = """<div> <h1>Title</h1> <p>A long text........ </p> <a href=""> a link </a> </div>"""

使用纯Python，无需外部模块I想拥有这个：

using pure Python, with no external module I want to have this:

>>> print remove_tags(text) Title A long text..... a link

我知道我可以使用 lxml.html.fromstring（text）.text_content（）来实现，但我需要在纯Python中使用内置或std库实现2.6 +

I know I can do it using lxml.html.fromstring(text).text_content() but I need to achieve the same in pure Python using builtin or std library for 2.6+

我该怎么做？

How can I do that?

推荐答案

使用正则表达式

使用正则表达式可以清理<> 中的所有内容：

import re def cleanhtml(raw_html): cleanr = re.compile('<.*?>') cleantext = re.sub(cleanr, '', raw_html) return cleantext

使用BeautifulSoup

您也可以使用 BeautifulSoup additionnal包来查找所有原始文本

Using BeautifulSoup

You could also use BeautifulSoup additionnal package to find out all the raw text

在调用BeautifulSoup
时，您需要明确设置解析器。我推荐在其他答案中提到的lxml（比默认值一个（即没有额外安装的情况下可用）'html.parser'

You will need to explicitly set a parser when calling BeautifulSoup I recommand "lxml" as mentionned in alternative answers (puch more robist than the default one (i.e available without additionnal install) 'html.parser'

from bs4 import BeautifulSoup cleantext = BeautifulSoup(raw_html, "lxml").text

但它并不妨碍你使用外部库，所以我推荐第一个解决方案。

But it doesn't prevent you from using external libraries, so I recommend the first solution.

这篇关于Python代码从字符串中删除HTML标签的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Python代码从字符串中删除HTML标签 [英] Python code to remove HTML tags from a string

问题描述

推荐答案

使用正则表达式

使用BeautifulSoup

Using BeautifulSoup

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

Python代码从字符串中删除HTML标签 [英] Python code to remove HTML tags from a string

问题描述

推荐答案

使用正则表达式

使用BeautifulSoup

Using BeautifulSoup

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭