BeatifulSoup4 get_text 仍然有 javascript [英] BeatifulSoup4 get_text still has javascript
问题描述
我正在尝试使用 bs4 删除所有 html/javascript,但是,它并没有删除 javascript.我仍然在那里看到它的文字.我该如何解决这个问题?
I'm trying to remove all the html/javascript using bs4, however, it doesn't get rid of javascript. I still see it there with the text. How can I get around this?
我尝试使用 nltk
,但效果很好,clean_html
和 clean_url
将被删除.有没有办法使用汤 get_text
并获得相同的结果?
I tried using nltk
which works fine however, clean_html
and clean_url
will be removed moving forward. Is there a way to use soups get_text
and get the same result?
我尝试查看其他页面:
BeautifulSoup get_text 不会去除所有标签和 JavaScript
目前我正在使用 nltk 的弃用函数.
Currently i'm using the nltk's deprecated functions.
编辑
这是一个例子:
import urllib
from bs4 import BeautifulSoup
url = "http://www.cnn.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
print soup.get_text()
我仍然看到 CNN 的以下内容:
I still see the following for CNN:
$j(function() {
"use strict";
if ( window.hasOwnProperty('safaripushLib') && window.safaripushLib.checkEnv() ) {
var pushLib = window.safaripushLib,
current = pushLib.currentPermissions();
if (current === "default") {
pushLib.checkPermissions("helloClient", function() {});
}
}
});
/*globals MainLocalObj*/
$j(window).load(function () {
'use strict';
MainLocalObj.init();
});
如何删除js?
我发现的其他选项只有:
Only other options I found are:
https://github.com/aaronsw/html2text
html2text
的问题在于它有时真的真的很慢,并且会产生明显的延迟,这是 nltk 一直非常擅长的一件事.
The problem with html2text
is that it's really really slow at times, and creates noticable lag, which is one thing nltk was always very good with.
推荐答案
部分基于 我可以使用 BeautifulSoup 删除脚本标签吗?
import urllib
from bs4 import BeautifulSoup
url = "http://www.cnn.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style"]):
script.decompose() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '
'.join(chunk for chunk in chunks if chunk)
print(text)
这篇关于BeatifulSoup4 get_text 仍然有 javascript的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!