从Python中的字符串中剥离HTML [英] Strip HTML from strings in Python
问题描述
from机械化导入浏览器
br =浏览器()
br.open('http:// somewebpage')
html = br.response ().readlines()
为html中的行:
print line
当在HTML文件中打印一行时,我试图找到一种方法来仅显示每个HTML元素的内容,而不是格式本身。如果它发现'< a href =whatever.com>一些文字< / a>'
,它只会打印'一些文字','< b> hello< / b>'
打印'hello'等。怎么去做这件事? >解决方案
我总是使用这个函数去掉HTML标签,因为它只需要Python stdlib:
在Python 2上
(HTMLParser):def __init __(self):
$ b
$ b self.reset()
self.fed = []
def handle_data(self,d):
self.fed.append(d)
def get_data(self) :
return''.join(self.fed)
def strip_tags(html):
s = MLStripper()
s.feed(html)
返回s.get_data()
对于Python 3
from html.parser import HTMLParser
class MLStripper(HTMLParser):
def __init __(self):
self.reset()
self.strict = False
self.convert_charrefs = True
self.fed = []
def handle_data(self,d):
self.fed.append(d)
def get_data(self):
return''.join(self.fed)
$ b $ def strip_tags(html):
s = MLStripper()
s.feed(html)
返回s.get_data()
注意:这仅适用于3.1。对于3.2或更高版本,您需要调用父类的 init 函数。请参阅在Python 3.2中使用HTMLParser
from mechanize import Browser
br = Browser()
br.open('http://somewebpage')
html = br.response().readlines()
for line in html:
print line
When printing a line in an HTML file, I'm trying to find a way to only show the contents of each HTML element and not the formatting itself. If it finds '<a href="whatever.com">some text</a>'
, it will only print 'some text', '<b>hello</b>'
prints 'hello', etc. How would one go about doing this?
I always used this function to strip HTML tags, as it requires only the Python stdlib:
On Python 2
from HTMLParser import HTMLParser
class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.fed = []
def handle_data(self, d):
self.fed.append(d)
def get_data(self):
return ''.join(self.fed)
def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()
For Python 3
from html.parser import HTMLParser
class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.strict = False
self.convert_charrefs= True
self.fed = []
def handle_data(self, d):
self.fed.append(d)
def get_data(self):
return ''.join(self.fed)
def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()
Note: this works only for 3.1. For 3.2 or above, you need to call the parent class's init function. See Using HTMLParser in Python 3.2
这篇关于从Python中的字符串中剥离HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!