BeautifulSoup get_text不去除所有标签和JavaScript [英] BeautifulSoup get_text does not strip all tags and JavaScript

查看:874
本文介绍了BeautifulSoup get_text不去除所有标签和JavaScript的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用BeautifulSoup从网页中获取文本。

I am trying to use BeautifulSoup to get text from web pages.

下面是我写这样做的脚本。它有两个参数,第一个是输入HTML或XML文件,第二个输出文件。

Below is a script I've written to do so. It takes two arguments, first is the input HTML or XML file, the second output file.

import sys
from bs4 import BeautifulSoup

def stripTags(s): return BeautifulSoup(s).get_text()

def stripTagsFromFile(inFile, outFile):
    open(outFile, 'w').write(stripTags(open(inFile).read()).encode("utf-8"))

def main(argv):
    if len(sys.argv) <> 3:
        print 'Usage:\t\t', sys.argv[0], 'input.html output.txt'
        return 1
    stripTagsFromFile(sys.argv[1], sys.argv[2])
    return 0

if __name__ == "__main__":
    sys.exit(main(sys.argv))

不幸的是,许多网页,例如:<一href=\"http://www.greatjobsinteaching.co.uk/career/134112/Education-Manager-Location\">http://www.greatjobsinteaching.co.uk/career/134112/Education-Manager-Location
我得到这样的事情(我只显示几首行):

Unfortunately, for many web pages, for example: http://www.greatjobsinteaching.co.uk/career/134112/Education-Manager-Location I get something like this (I'm showing only few first lines):

html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
    Education Manager  Job In London With  Caleeda | Great Jobs In Teaching

var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'UA-15255540-21']);
_gaq.push(['_trackPageview']);
_gaq.push(['_trackPageLoadTime']);

这有什么错我的脚本?我试图通过XML作为第二个参数BeautifulSoup的构造,以及h​​tml5lib'和'LXML,但它并不能帮助。
是否有BeautifulSoup替代这将更好地为这项任务?我想要的是提取这将在浏览器中呈现该网页中的文本。

Is there anything wrong with my script? I was trying to pass 'xml' as the second argument to BeautifulSoup's constructor, as well as 'html5lib' and 'lxml', but it doesn't help. Is there an alternative to BeautifulSoup which would work better for this task? All I want is to extract the text which would be rendered in a browser for this web page.

任何帮助将非常AP preciated。

Any help will be much appreciated.

推荐答案

NLTK的 clean_html()是相当擅长此道!

nltk's clean_html() is quite good at this!

假设你已经保存在一个变量你的HTML HTML

Assuming that your already have your html stored in a variable html like

html = urllib.urlopen(address).read()

就用

import nltk
clean_text = nltk.clean_html(html)

更新

clean_html 和支持 clean_url 将被丢弃的NLTK的未来版本。请使用BeautifulSoup现在...这是非常不幸的。

Support for clean_html and clean_url will be dropped for future versions of nltk. Please use BeautifulSoup for now...it's very unfortunate.

这是如何实现这一目标的一个例子是此页面上:

An example on how to achieve this is on this page:

<一个href=\"http://stackoverflow.com/questions/22799990/beatifulsoup4-get-text-still-has-javascript\">BeatifulSoup4 get_text还有JavaScript的

这篇关于BeautifulSoup get_text不去除所有标签和JavaScript的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆