BeatifulSoup4 get_text 仍然有 javascript [英] BeatifulSoup4 get_text still has javascript

查看:16
本文介绍了BeatifulSoup4 get_text 仍然有 javascript的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 bs4 删除所有 html/javascript,但是,它并没有删除 javascript.我仍然在那里看到它的文字.我该如何解决这个问题?

I'm trying to remove all the html/javascript using bs4, however, it doesn't get rid of javascript. I still see it there with the text. How can I get around this?

我尝试使用 nltk,但效果很好,clean_htmlclean_url 将被删除.有没有办法使用汤 get_text 并获得相同的结果?

I tried using nltk which works fine however, clean_html and clean_url will be removed moving forward. Is there a way to use soups get_text and get the same result?

我尝试查看其他页面:

BeautifulSoup get_text 不会去除所有标签和 JavaScript

目前我正在使用 nltk 的弃用函数.

Currently i'm using the nltk's deprecated functions.

编辑

这是一个例子:

import urllib
from bs4 import BeautifulSoup

url = "http://www.cnn.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
print soup.get_text()

我仍然看到 CNN 的以下内容:

I still see the following for CNN:

$j(function() {
"use strict";
if ( window.hasOwnProperty('safaripushLib') && window.safaripushLib.checkEnv() ) {
var pushLib = window.safaripushLib,
current = pushLib.currentPermissions();
if (current === "default") {
pushLib.checkPermissions("helloClient", function() {});
}
}
});

/*globals MainLocalObj*/
$j(window).load(function () {
'use strict';
MainLocalObj.init();
});

如何删除js?

我发现的其他选项只有:

Only other options I found are:

https://github.com/aaronsw/html2text

html2text 的问题在于它有时真的真的很慢,并且会产生明显的延迟,这是 nltk 一直非常擅长的一件事.

The problem with html2text is that it's really really slow at times, and creates noticable lag, which is one thing nltk was always very good with.

推荐答案

部分基于 我可以使用 BeautifulSoup 删除脚本标签吗?

import urllib
from bs4 import BeautifulSoup

url = "http://www.cnn.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.decompose()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '
'.join(chunk for chunk in chunks if chunk)

print(text)

这篇关于BeatifulSoup4 get_text 仍然有 javascript的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆