使用Python中的BeautifulSoup从HTML脚本标签中提取JSON [英] Extract JSON from HTML Script tag with BeautifulSoup in Python

查看:71
本文介绍了使用Python中的BeautifulSoup从HTML脚本标签中提取JSON的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下HTML,应该怎么做才能从变量中提取JSON: window .__ INITIAL_STATE __

I have the following HTML, and what should I do to extract the JSON from the variable: window.__INITIAL_STATE__

<!DOCTYPE doctype html>

<html lang="en">
<script>
                  window.sessConf = "-2912474957111138742";
                  /* <sl:translate_json> */
                  window.__INITIAL_STATE__ = { /* Target JSON here with 12 million characters */};
                  /* </sl:translate_json> */
                </script>
</html>

推荐答案

您可以使用以下Python代码提取JavaScript代码.

You can use the following Python code to extract the JavaScript code.

soup = BeautifulSoup(html)
s=soup.find('script')
js = 'window = {};\n'+s.text.strip()+';\nprocess.stdout.write(JSON.stringify(window.__INITIAL_STATE__));'
with open('temp.js','w') as f:
    f.write(js)

JS代码将被写入文件"temp.js".然后,您可以调用 node 执行JS文件.

The JS code will be written to a file "temp.js". Then you can call node to execute the JS file.

from subprocess import check_output
window_init_state = check_output(['node','temp.js'])

python变量 window_init_state 包含JS对象 window .__ INITIAL_STATE __ 的JSON字符串,您可以使用 JSONDecoder 在python中对其进行解析./p>

示例

The python variable window_init_state contains the JSON string of the JS object window.__INITIAL_STATE__, which you can parse in python with JSONDecoder.

from subprocess import check_output
import json, bs4
html='''<!DOCTYPE doctype html>

<html lang="en">
<script> window.sessConf = "-2912474957111138742";
                  /* <sl:translate_json> */
                  window.__INITIAL_STATE__ = { 'Hello':'World'};
                  /* </sl:translate_json> */
                </script>
</html>'''
soup = bs4.BeautifulSoup(html)
with open('temp.js','w') as f:
    f.write('window = {};\n'+
            soup.find('script').text.strip()+
            ';\nprocess.stdout.write(JSON.stringify(window.__INITIAL_STATE__));')
window_init_state = check_output(['node','temp.js'])
print(json.loads(window_init_state))

输出:

{'Hello': 'World'}

这篇关于使用Python中的BeautifulSoup从HTML脚本标签中提取JSON的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆