BeautifulSoup抢可见网页文本 [英] BeautifulSoup Grab Visible Webpage Text

查看:244
本文介绍了BeautifulSoup抢可见网页文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

基本上,我想用BeautifulSoup在网页上严格的可见文本的抢。例如,此网页是我的测试用例。我主要是想先手正文(条),甚至几片名字在这里和那里。我曾尝试在此建议<一href=\"http://stackoverflow.com/questions/1752662/beautifulsoup-easy-way-to-to-obtain-html-free-contents\">SO脚本&GT;返回大量&LT的问题 标记和我不想要HTML注释。我想不通,我需要的功能<一个参数href=\"http://www.crummy.com/software/BeautifulSoup/documentation.html#arg-limit\"><$c$c>findAll()为了只得到网页上的文本可见。

Basically, I want to use BeautifulSoup to grab strictly the visible text on a webpage. For instance, this webpage is my test case. And I mainly want to just get the body text (article) and maybe even a few tab names here and there. I have tried the suggestion in this SO question that returns lots of <script> tags and html comments which I don't want. I can't figure out the arguments I need for the function findAll() in order to just get the visible texts on a webpage.

所以,我应该怎么找不包括脚本,注释,CSS等所有可见的文字?

So, how should I find all visible text excluding scripts, comments, css etc.?

推荐答案

试试这个:

html = urllib.urlopen('http://www.nytimes.com/2009/12/21/us/21storm.html').read()
soup = BeautifulSoup(html, 'html.parser')
texts = soup.findAll(text=True)

def visible(element):
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif re.match('<!--.*-->', str(element)):
        return False
    return True

visible_texts = filter(visible, texts)

这篇关于BeautifulSoup抢可见网页文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆