只从 HTML 文件中获取脚本 [英] Only get scripts out of HTML file
问题描述
我有一个包含网站完整代码的大型 html 文件.我只关心<script>...<script/>
里面的代码.有没有办法轻松地将这些行从 html 文件中取出?或者我是否必须按每个 拆分文件?我想忽略第一个
<script>
之前的部分(如头部),我需要忽略文件末尾的标签以及中间的标签从 切换到
.
如果要删除所有脚本标签:
from bs4 import BeautifulSouppagehtml = '''<li>文本 1<脚本><li>文本 2<li>文本 3<li>文本 4<脚本><li>文本 5'''汤 = BeautifulSoup(pagehtml, 'html.parser')[s.extract() for s in soup.findAll('script')]
<小时><预><代码>>>>汤<li>文本 1<li>文本 4>>>
I have a large html file that contains the full code from a website. I only care about the code inside <script>...<script/>
. Is there a way to easily just take those lines out of the html file? Or will I have to split the file by each <script>
? I'll want to ignore the parts that come before the first <script>
(like the head) and I need to ignore the tags at the end of the file as well in the middle like where it switches from <head>
to <body>
.
if you want remove All script tags:
from bs4 import BeautifulSoup
pagehtml = '''
<li> Text 1 </li>
<script>
<li> Text 2 </li>
<li> Text 3 </li>
</script>
<li> Text 4 </li>
<script>
<li> Text 5 </li>
</script>
'''
soup = BeautifulSoup(pagehtml, 'html.parser')
[s.extract() for s in soup.findAll('script')]
>>> soup
<li> Text 1 </li>
<li> Text 4 </li>
>>>
这篇关于只从 HTML 文件中获取脚本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!