如何在python的HTML页面中剥离整个HTML,CSS和JS代码或标签 [英] How to strip entire HTML, CSS and JS code or tags from HTML page in python
问题描述
Possible Duplicate:
BeautifulSoup Grab Visible Webpage Text
Web scraping with Python
说我是一个非常复杂的HTML页面,其中包含普通的HTML标签,CSS& JS在中间.我们可能会看到所有最坏的情况.
Say I am a very complex HTML page consisting usual HTML tags, CSS & JS in the middle. We might see all worst cases.
我想要的是剥去所有上面的标记/代码并返回文本".
All I want is strip all the above tags/ code and return "text".
简单来说:
<html><body>Text</body></html>
其中可能包含JS,CSS等.
This might contain JS, CSS etc. etc..
我正在尝试使用BeautifulSoup,但并未从代码中删除JS.现在,我正在考虑使用Regex ..但不确定如何做
I am trying to use BeautifulSoup but its not removing JS from the code.. Now ,I am thinking to use Regex.. but not sure how to do
edit1
这是我尝试的简单引导HTML页面...
Here is my try on a simple bootstrap html page...
from bs4 import BeautifulSoup as bs
import requests
bs( requests.get(MY-URL).text ).get_text()
$返回文字
html
Home
Le styles
body {
padding-top: 10%;
padding-left: 30%;
}
HTML5 shim, for IE6-8 support of HTML5 elements
[if lt IE 9]>
<script src="http://htm...html5.js"></script>
<![endif]
Home | Under Construction
Sample Page 1
The app
might
face some ........
Firefox
. Ple..
/container
var _gaq = _gaq || [];
_gaq.push(['_trackPageview']);
(function() {
var ga = do...............
})();
推荐答案
Django使用此功能从文本中剥离标签:
Django using this function to strip tags from text:
def strip_tags(value):
"""Returns the given HTML with all tags stripped."""
return re.sub(r'<[^>]*?>', '', force_unicode(value))
(您不需要force_unicode部分)
(You won't need the force_unicode part)
这篇关于如何在python的HTML页面中剥离整个HTML,CSS和JS代码或标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!