BeautifulSoup抢可见网页文本 [英] BeautifulSoup Grab Visible Webpage Text

查看：244 发布时间：2016/8/5 18:52:33 python text beautifulsoup html-content-extraction

本文介绍了BeautifulSoup抢可见网页文本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

基本上，我想用BeautifulSoup在网页上严格的可见文本的抢。例如，此网页是我的测试用例。我主要是想先手正文（条），甚至几片名字在这里和那里。我曾尝试在此建议<一href=\"http://stackoverflow.com/questions/1752662/beautifulsoup-easy-way-to-to-obtain-html-free-contents\">SO脚本＆GT;返回大量＆LT的问题 标记和我不想要HTML注释。我想不通，我需要的功能<一个参数href=\"http://www.crummy.com/software/BeautifulSoup/documentation.html#arg-limit\"><$c$c>findAll()为了只得到网页上的文本可见。

Basically, I want to use BeautifulSoup to grab strictly the visible text on a webpage. For instance, this webpage is my test case. And I mainly want to just get the body text (article) and maybe even a few tab names here and there. I have tried the suggestion in this SO question that returns lots of <script> tags and html comments which I don't want. I can't figure out the arguments I need for the function findAll() in order to just get the visible texts on a webpage.

所以，我应该怎么找不包括脚本，注释，CSS等所有可见的文字？

So, how should I find all visible text excluding scripts, comments, css etc.?

推荐答案

试试这个：

html = urllib.urlopen('http://www.nytimes.com/2009/12/21/us/21storm.html').read()
soup = BeautifulSoup(html, 'html.parser')
texts = soup.findAll(text=True)

def visible(element):
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif re.match('<!--.*-->', str(element)):
        return False
    return True

visible_texts = filter(visible, texts)

这篇关于BeautifulSoup抢可见网页文本的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

BeautifulSoup抢可见网页文本 [英] BeautifulSoup Grab Visible Webpage Text

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

BeautifulSoup抢可见网页文本 [英] BeautifulSoup Grab Visible Webpage Text

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭