只从 HTML 文件中获取脚本 [英] Only get scripts out of HTML file

查看:27
本文介绍了只从 HTML 文件中获取脚本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含网站完整代码的大型 html 文件.我只关心<script>...<script/>里面的代码.有没有办法轻松地将这些行从 html 文件中取出?或者我是否必须按每个 <li>文本 4<脚本><li>文本 5'''汤 = BeautifulSoup(pagehtml, 'html.parser')[s.extract() for s in soup.findAll('script')]

<小时><预><代码>>>>汤<li>文本 1<li>文本 4>>>

I have a large html file that contains the full code from a website. I only care about the code inside <script>...<script/>. Is there a way to easily just take those lines out of the html file? Or will I have to split the file by each <script>? I'll want to ignore the parts that come before the first <script> (like the head) and I need to ignore the tags at the end of the file as well in the middle like where it switches from <head> to <body>.

解决方案

if you want remove All script tags:

from bs4 import BeautifulSoup
pagehtml = '''
<li> Text 1 </li>
<script>
<li> Text 2 </li>
<li> Text 3 </li>
</script>
<li> Text 4 </li>
<script>
<li> Text 5 </li>
</script>
'''
soup = BeautifulSoup(pagehtml, 'html.parser')
[s.extract() for s in soup.findAll('script')]


>>> soup

<li> Text 1 </li>

<li> Text 4 </li>

>>>

这篇关于只从 HTML 文件中获取脚本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆