提取所有&LT;脚本&GT;在一个HTML页面和标记附加到文档的底部 [英] Extract all <script> tags in an HTML page and append to the bottom of the document
问题描述
有人能告诉我,我怎么能提取并删除所有&LT;脚本&gt;在HTML文档中
标记并将其添加到文档的末尾,右侧前&LT; /身体GT;&LT; / HTML&GT;
?我想尽量避免使用 LXML
请
Could someone tell me how I can extract and remove all the <script>
tags in a HTML document and add them to the end of the document, right before the </body></html>
? I'd like to try and avoid using lxml
please.
感谢。
推荐答案
答案很简单,可能会错过许多细微差别。怎么过,这应该给你如何去这样做,改进它,一般的想法。我相信这是可以改善,但你应该能够与文档的帮助下做到这一点很快。
The answer is simple and may miss many nuances. How ever, this should give you an idea of how to go about doing it, improving it in general. I am sure this can be improved but you should be able to do that quickly with help of the documentation.
参考文档: http://www.crummy.com/software/BeautifulSoup/documentation html的
from BeautifulSoup import BeautifulSoup
doc = ['<html><script type="text/javascript">document.write("Hello World!")',
'</script><head><title>Page title</title></head>',
'<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
'<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
'</html>']
soup = BeautifulSoup(''.join(doc))
for tag in soup.findAll('script'):
# Use extract to remove the tag
tag.extract()
# use simple insert
soup.body.insert(len(soup.body.contents), tag)
print soup.prettify()
输出:
<html>
<head>
<title>
Page title
</title>
</head>
<body>
<p id="firstpara" align="center">
This is paragraph
<b>
one
</b>
.
</p>
<p id="secondpara" align="blah">
This is paragraph
<b>
two
</b>
.
</p>
<script type="text/javascript">
document.write("Hello World!")
</script>
</body>
</html>
这篇关于提取所有&LT;脚本&GT;在一个HTML页面和标记附加到文档的底部的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!