如何使用 Python 提取在 HTML 页面 javascript 块中定义的 JSON 对象? [英] How to extract a JSON object that was defined in a HTML page javascript block using Python?
问题描述
我正在下载以下列方式定义了数据的 HTML 页面:
I am downloading HTML pages that have data defined in them in the following way:
... <script type= "text/javascript"> window.blog.data = {"activity":{"type":"read"}}; </script> ...
我想提取'window.blog.data'中定义的JSON对象.有没有比手动解析更简单的方法?(我正在研究 Beautiful Soap,但似乎无法找到一种无需解析即可返回确切对象的方法)
I would like to extract the JSON object defined in 'window.blog.data'. Is there a simpler way than parsing it manually? (I am looking into Beautiful Soap but can't seem to find a method that will return the exact object without parsing)
谢谢
使用 python 无头浏览器(例如 Ghost.py)执行此操作是否可能且更正确?
Would it be possible and more correct to do this with a python headless browser (e.g., Ghost.py)?
推荐答案
BeautifulSoup 是一个 html 解析器;您还需要一个 javascript 解析器.顺便说一句,某些 javascript 对象文字不是有效的 json(尽管在您的示例中文字也是有效的 json 对象).
BeautifulSoup is an html parser; you also need a javascript parser here. btw, some javascript object literals are not valid json (though in your example the literal is also a valid json object).
在简单的情况下,您可以:
In simple cases you could:
- 使用 html 解析器提取
的文本
- 假设
window.blog...
是单行或对象内没有';'
并使用简单的字符串操作提取 javascript 对象文字或一个正则表达式 - 假设字符串是一个有效的 json 并使用 json 模块解析它
- extract
<script>
's text using an html parser - assume that
window.blog...
is a single line or there is no';'
inside the object and extract the javascript object literal using simple string manipulations or a regex - assume that the string is a valid json and parse it using json module
示例:
#!/usr/bin/env python
html = """<!doctype html>
<title>extract javascript object as json</title>
<script>
// ..
window.blog.data = {"activity":{"type":"read"}};
// ..
</script>
<p>some other html here
"""
import json
import re
from bs4 import BeautifulSoup # $ pip install beautifulsoup4
soup = BeautifulSoup(html)
script = soup.find('script', text=re.compile('window.blog.data'))
json_text = re.search(r'^s*window.blog.datas*=s*({.*?})s*;s*$',
script.string, flags=re.DOTALL | re.MULTILINE).group(1)
data = json.loads(json_text)
assert data['activity']['type'] == 'read'
如果假设不正确,则代码失败.
If the assumptions are incorrect then the code fails.
为了放宽第二个假设,可以使用 javascript 解析器代替正则表达式,例如 slimit
(@approximatenumber 建议):
To relax the second assumption, a javascript parser could be used instead of a regex e.g., slimit
(suggested by @approximatenumber):
from slimit import ast # $ pip install slimit
from slimit.parser import Parser as JavascriptParser
from slimit.visitors import nodevisitor
soup = BeautifulSoup(html, 'html.parser')
tree = JavascriptParser().parse(soup.script.string)
obj = next(node.right for node in nodevisitor.visit(tree)
if (isinstance(node, ast.Assign) and
node.left.to_ecma() == 'window.blog.data'))
# HACK: easy way to parse the javascript object literal
data = json.loads(obj.to_ecma()) # NOTE: json format may be slightly different
assert data['activity']['type'] == 'read'
无需将对象字面量 (obj
) 视为 json 对象.为了获得必要的信息,可以像其他 ast 节点一样递归访问 obj
.它将允许支持任意 javascript 代码(可以通过 slimit
解析).
There is no need to treat the object literal (obj
) as a json object. To get the necessary info, obj
can be visited recursively like other ast nodes. It would allow to support arbitrary javascript code (that can be parsed by slimit
).
这篇关于如何使用 Python 提取在 HTML 页面 javascript 块中定义的 JSON 对象?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!