Python,从字符串中删除所有html标签 [英] Python, remove all html tags from string

查看:92
本文介绍了Python,从字符串中删除所有html标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用带有以下代码的beautifulsoup从网站访问文章内容:

I am trying to access the article content from a website, using beautifulsoup with the below code:

site= 'www.example.com'
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
content = soup.find_all('p')
content=str(content)

内容对象包含页面中所有位于'p'标记内的主要文本,但是在输出中仍然存在其他标记,如下图所示.我想删除匹配对的<中包含的所有字符.>标签和标签本身.以便仅保留文本.

the content object contains all of the main text from the page that is within the 'p' tag, however there are still other tags present within the output as can be seen in the image below. I would like to remove all characters that are enclosed in matching pairs of < > tags and the tags themselves. so that only the text remains.

我尝试了以下方法,但似乎不起作用.

I have tried the following method, but it does not seem to work.

' '.join(item for item in content.split() if not (item.startswith('<') and item.endswith('>')))

在字符串中删除子字符串的最佳方法是什么?以某种模式(例如<>

What is the best way to remove substrings in a sting? that begin and end with a certain pattern such as < >

推荐答案

您可以使用 get_text()

You could use get_text()

for i in content:
    print i.get_text()

下面的示例来自文档:

Example below is from the docs:

>>> markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
>>> soup = BeautifulSoup(markup)
>>> soup.get_text()
u'\nI linked to example.com\n'

这篇关于Python,从字符串中删除所有html标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆