BeautifulSoup抢可见网页文本 [英] BeautifulSoup Grab Visible Webpage Text
问题描述
基本上,我想用BeautifulSoup在网页上严格的可见文本的抢。例如,此网页是我的测试用例。我主要是想先手正文(条),甚至几片名字在这里和那里。我曾尝试在此建议<一href=\"http://stackoverflow.com/questions/1752662/beautifulsoup-easy-way-to-to-obtain-html-free-contents\">SO脚本&GT;返回大量&LT的问题
标记和我不想要HTML注释。我想不通,我需要的功能<一个参数href=\"http://www.crummy.com/software/BeautifulSoup/documentation.html#arg-limit\"><$c$c>findAll()$c$c>为了只得到网页上的文本可见。
Basically, I want to use BeautifulSoup to grab strictly the visible text on a webpage. For instance, this webpage is my test case. And I mainly want to just get the body text (article) and maybe even a few tab names here and there. I have tried the suggestion in this SO question that returns lots of <script>
tags and html comments which I don't want. I can't figure out the arguments I need for the function findAll()
in order to just get the visible texts on a webpage.
所以,我应该怎么找不包括脚本,注释,CSS等所有可见的文字?
So, how should I find all visible text excluding scripts, comments, css etc.?
推荐答案
试试这个:
html = urllib.urlopen('http://www.nytimes.com/2009/12/21/us/21storm.html').read()
soup = BeautifulSoup(html, 'html.parser')
texts = soup.findAll(text=True)
def visible(element):
if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
return False
elif re.match('<!--.*-->', str(element)):
return False
return True
visible_texts = filter(visible, texts)
这篇关于BeautifulSoup抢可见网页文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!