BeautifulSoup get_text 不会剥离所有标签和 JavaScript [英] BeautifulSoup get_text does not strip all tags and JavaScript

查看:22
本文介绍了BeautifulSoup get_text 不会剥离所有标签和 JavaScript的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 BeautifulSoup 从网页中获取文本.

I am trying to use BeautifulSoup to get text from web pages.

下面是我为此编写的脚本.它需要两个参数,第一个是输入的 HTML 或 XML 文件,第二个是输出文件.

Below is a script I've written to do so. It takes two arguments, first is the input HTML or XML file, the second output file.

import sys
from bs4 import BeautifulSoup

def stripTags(s): return BeautifulSoup(s).get_text()

def stripTagsFromFile(inFile, outFile):
    open(outFile, 'w').write(stripTags(open(inFile).read()).encode("utf-8"))

def main(argv):
    if len(sys.argv) <> 3:
        print 'Usage:		', sys.argv[0], 'input.html output.txt'
        return 1
    stripTagsFromFile(sys.argv[1], sys.argv[2])
    return 0

if __name__ == "__main__":
    sys.exit(main(sys.argv))

不幸的是,对于许多网页,例如:http://www.greatjobsinteaching.co.uk/career/134112/Education-Manager-Location我得到了这样的东西(我只展示了几行第一行):

Unfortunately, for many web pages, for example: http://www.greatjobsinteaching.co.uk/career/134112/Education-Manager-Location I get something like this (I'm showing only few first lines):

html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
    Education Manager  Job In London With  Caleeda | Great Jobs In Teaching

var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'UA-15255540-21']);
_gaq.push(['_trackPageview']);
_gaq.push(['_trackPageLoadTime']);

我的脚本有问题吗?我试图将 'xml' 作为第二个参数传递给 BeautifulSoup 的构造函数,以及 'html5lib' 和 'lxml',但它没有帮助.有没有比 BeautifulSoup 更适合这项任务的替代品?我想要的只是提取将在浏览器中为该网页呈现的文本.

Is there anything wrong with my script? I was trying to pass 'xml' as the second argument to BeautifulSoup's constructor, as well as 'html5lib' and 'lxml', but it doesn't help. Is there an alternative to BeautifulSoup which would work better for this task? All I want is to extract the text which would be rendered in a browser for this web page.

任何帮助将不胜感激.

推荐答案

nltk 的 clean_html() 很擅长这个!

nltk's clean_html() is quite good at this!

假设您已经将 html 存储在变量 html 中,例如

Assuming that your already have your html stored in a variable html like

html = urllib.urlopen(address).read()

然后就用

import nltk
clean_text = nltk.clean_html(html)

更新

clean_htmlclean_url 的支持将在 nltk 的未来版本中删除.请暂时使用BeautifulSoup...非常不幸.

Support for clean_html and clean_url will be dropped for future versions of nltk. Please use BeautifulSoup for now...it's very unfortunate.

有关如何实现此目标的示例在此页面上:

An example on how to achieve this is on this page:

BeatifulSoup4 get_text 仍有 javascript

这篇关于BeautifulSoup get_text 不会剥离所有标签和 JavaScript的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆