BeautifulSoup get_text不去除所有标签和JavaScript [英] BeautifulSoup get_text does not strip all tags and JavaScript
问题描述
我想使用BeautifulSoup从网页中获取文本。
I am trying to use BeautifulSoup to get text from web pages.
下面是我写这样做的脚本。它有两个参数,第一个是输入HTML或XML文件,第二个输出文件。
Below is a script I've written to do so. It takes two arguments, first is the input HTML or XML file, the second output file.
import sys
from bs4 import BeautifulSoup
def stripTags(s): return BeautifulSoup(s).get_text()
def stripTagsFromFile(inFile, outFile):
open(outFile, 'w').write(stripTags(open(inFile).read()).encode("utf-8"))
def main(argv):
if len(sys.argv) <> 3:
print 'Usage:\t\t', sys.argv[0], 'input.html output.txt'
return 1
stripTagsFromFile(sys.argv[1], sys.argv[2])
return 0
if __name__ == "__main__":
sys.exit(main(sys.argv))
不幸的是,许多网页,例如:<一href=\"http://www.greatjobsinteaching.co.uk/career/134112/Education-Manager-Location\">http://www.greatjobsinteaching.co.uk/career/134112/Education-Manager-Location
我得到这样的事情(我只显示几首行):
Unfortunately, for many web pages, for example: http://www.greatjobsinteaching.co.uk/career/134112/Education-Manager-Location I get something like this (I'm showing only few first lines):
html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
Education Manager Job In London With Caleeda | Great Jobs In Teaching
var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'UA-15255540-21']);
_gaq.push(['_trackPageview']);
_gaq.push(['_trackPageLoadTime']);
这有什么错我的脚本?我试图通过XML作为第二个参数BeautifulSoup的构造,以及html5lib'和'LXML,但它并不能帮助。
是否有BeautifulSoup替代这将更好地为这项任务?我想要的是提取这将在浏览器中呈现该网页中的文本。
Is there anything wrong with my script? I was trying to pass 'xml' as the second argument to BeautifulSoup's constructor, as well as 'html5lib' and 'lxml', but it doesn't help. Is there an alternative to BeautifulSoup which would work better for this task? All I want is to extract the text which would be rendered in a browser for this web page.
任何帮助将非常AP preciated。
Any help will be much appreciated.
推荐答案
NLTK的 clean_html()
是相当擅长此道!
nltk's clean_html()
is quite good at this!
假设你已经保存在一个变量你的HTML HTML
像
Assuming that your already have your html stored in a variable html
like
html = urllib.urlopen(address).read()
就用
import nltk
clean_text = nltk.clean_html(html)
更新
为 clean_html
和支持 clean_url
将被丢弃的NLTK的未来版本。请使用BeautifulSoup现在...这是非常不幸的。
Support for clean_html
and clean_url
will be dropped for future versions of nltk. Please use BeautifulSoup for now...it's very unfortunate.
这是如何实现这一目标的一个例子是此页面上:
An example on how to achieve this is on this page:
<一个href=\"http://stackoverflow.com/questions/22799990/beatifulsoup4-get-text-still-has-javascript\">BeatifulSoup4 get_text还有JavaScript的
这篇关于BeautifulSoup get_text不去除所有标签和JavaScript的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!