BeatifulSoup4 get_text还有JavaScript的 [英] BeatifulSoup4 get_text still has javascript

查看:409
本文介绍了BeatifulSoup4 get_text还有JavaScript的的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图用BS4删除所有的HTML / JavaScript的,但是,它并没有摆脱的JavaScript。我仍然可以看到它有与文本。我怎样才能解决这个得到什么?

I'm trying to remove all the html/javascript using bs4, however, it doesn't get rid of javascript. I still see it there with the text. How can I get around this?

我试着用 NLTK 工作正常但 clean_html clean_url 将被删除向前发展。有没有办法使用汤 get_text ,并得到相同的结果的方法?

I tried using nltk which works fine however, clean_html and clean_url will be removed moving forward. Is there a way to use soups get_text and get the same result?

我试过看着这些其他页面:

I tried looking at these other pages:

<一个href=\"http://stackoverflow.com/questions/10524387/beautifulsoup-get-text-does-not-strip-all-tags-and-javascript\">BeautifulSoup get_text不去除所有标签和JavaScript

目前我使用的是NLTK的德precated功能。

Currently i'm using the nltk's deprecated functions.

修改

下面是一个例子:

import urllib
from bs4 import BeautifulSoup

url = "http://www.cnn.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
print soup.get_text()

我仍然看到以下为CNN:

I still see the following for CNN:

$j(function() {
"use strict";
if ( window.hasOwnProperty('safaripushLib') && window.safaripushLib.checkEnv() ) {
var pushLib = window.safaripushLib,
current = pushLib.currentPermissions();
if (current === "default") {
pushLib.checkPermissions("helloClient", function() {});
}
}
});

/*globals MainLocalObj*/
$j(window).load(function () {
'use strict';
MainLocalObj.init();
});

我如何删除JS?

How can I remove the js?

只有我发现了其他的选项有:

Only other options I found are:

https://github.com/aaronsw/html2text

html2text 的问题是,它是真正的真正的缓慢的时候,并创建noticable滞后,这是一件事NLTK总是很不错用。

The problem with html2text is that it's really really slow at times, and creates noticable lag, which is one thing nltk was always very good with.

推荐答案

根据部分上的我可以用BeautifulSoup删除脚本标记?

import urllib
from bs4 import BeautifulSoup

url = "http://www.cnn.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

这篇关于BeatifulSoup4 get_text还有JavaScript的的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆