从Python中的字符串中剥离HTML [英] Strip HTML from strings in Python

查看：117 发布时间：2018/6/13 9:27:50 python html

本文介绍了从Python中的字符串中剥离HTML的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

  from机械化导入浏览器
 br =浏览器（）
 br.open（'http：// somewebpage'）
 html = br.response （）.readlines（）
为html中的行：
 print line

当在HTML文件中打印一行时，我试图找到一种方法来仅显示每个HTML元素的内容，而不是格式本身。如果它发现'< a href =whatever.com>一些文字< / a>'，它只会打印'一些文字'，'< b> hello< / b>'打印'hello'等。怎么去做这件事？ >解决方案

我总是使用这个函数去掉HTML标签，因为它只需要Python stdlib：

在Python 2上
（HTMLParser）：
def __init __（self）：

$ b
$ b self.reset（）
self.fed = []
def handle_data（self，d）：
self.fed.append（d）
def get_data（self）：
return''.join（self.fed）

def strip_tags（html）：
s = MLStripper（）
s.feed（html）
返回s.get_data（）

对于Python 3
from html.parser import HTMLParser class MLStripper（HTMLParser）： def __init __（self）： self.reset（） self.strict = False self.convert_charrefs = True self.fed = [] def handle_data（self，d）： self.fed.append（d） def get_data（self）： return''.join（self.fed） $ b $ def strip_tags（html）： s = MLStripper（） s.feed（html）返回s.get_data（）
注意：这仅适用于3.1。对于3.2或更高版本，您需要调用父类的 init 函数。请参阅在Python 3.2中使用HTMLParser

from mechanize import Browser br = Browser() br.open('http://somewebpage') html = br.response().readlines() for line in html: print line
When printing a line in an HTML file, I'm trying to find a way to only show the contents of each HTML element and not the formatting itself. If it finds '<a href="whatever.com">some text</a>', it will only print 'some text', '<b>hello</b>' prints 'hello', etc. How would one go about doing this?
解决方案
I always used this function to strip HTML tags, as it requires only the Python stdlib:

On Python 2
from HTMLParser import HTMLParser class MLStripper(HTMLParser): def __init__(self): self.reset() self.fed = [] def handle_data(self, d): self.fed.append(d) def get_data(self): return ''.join(self.fed) def strip_tags(html): s = MLStripper() s.feed(html) return s.get_data()
For Python 3
from html.parser import HTMLParser class MLStripper(HTMLParser): def __init__(self): self.reset() self.strict = False self.convert_charrefs= True self.fed = [] def handle_data(self, d): self.fed.append(d) def get_data(self): return ''.join(self.fed) def strip_tags(html): s = MLStripper() s.feed(html) return s.get_data()
Note: this works only for 3.1. For 3.2 or above, you need to call the parent class's init function. See Using HTMLParser in Python 3.2

这篇关于从Python中的字符串中剥离HTML的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从Python中的字符串中剥离HTML [英] Strip HTML from strings in Python

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

从Python中的字符串中剥离HTML [英] Strip HTML from strings in Python

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭