如何提取在HTML页面中使用Python的JavaScript块中定义一个JSON对象? [英] How to extract a JSON object that was defined in a HTML page javascript block using Python?

查看:224
本文介绍了如何提取在HTML页面中使用Python的JavaScript块中定义一个JSON对象?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我下载具有以下面的方式在其中定义的数据的HTML网页:

I am downloading HTML pages that have data defined in them in the following way:

... <script type= "text/javascript">    window.blog.data = {"activity":{"type":"read"}}; </script> ...

我想提取window.blog.data'定义的JSON对象。
有没有比手动解析它更简单的方法? (我期待到美丽的肥皂,但似乎无法找到,将返回的确切对象而不解析的方法)

I would like to extract the JSON object defined in 'window.blog.data'. Is there a simpler way than parsing it manually? (I am looking into Beautiful Soap but can't seem to find a method that will return the exact object without parsing)

感谢

编辑:
难道是可能的,更正确与蟒蛇无头的浏览器(例如,Ghost.py)这样做呢?

Would it be possible and more correct to do this with a python headless browser (e.g., Ghost.py)?

推荐答案

BeautifulSoup是一个HTML解析器;你还需要在这里一个javascript解析器。顺便说一句,一些JavaScript对象文字是不是有效的JSON(尽管在你的榜样文字也是一个有效的JSON对象)。

BeautifulSoup is an html parser; you also need a javascript parser here. btw, some javascript object literals are not valid json (though in your example the literal is also a valid json object).

在简单的情况下,你可以:

In simple cases you could:


  1. 提取&LT;脚本&gt;使用一个HTML解析器的文本

  2. 假设 window.blog ... 是单行或者没有';'里面的对象,并使用简单的字符串操作或正则表达式提取JavaScript对象字面

  3. 假定该字符串是一个有效的JSON,并使用解析json模块

  1. extract <script>'s text using an html parser
  2. assume that window.blog... is a single line or there is no ';' inside the object and extract the javascript object literal using simple string manipulations or a regex
  3. assume that the string is a valid json and parse it using json module

例如:

#!/usr/bin/env python
html = """<!doctype html>
<title>extract javascript object as json</title>
<script>
// ..
window.blog.data = {"activity":{"type":"read"}};
// ..
</script>
<p>some other html here
"""
import json
import re
from bs4 import BeautifulSoup  # $ pip install beautifulsoup4
soup = BeautifulSoup(html)
script = soup.find('script', text=re.compile('window\.blog\.data'))
json_text = re.search(r'^\s*window\.blog\.data\s*=\s*({.*?})\s*;\s*$',
                      script.string, flags=re.DOTALL | re.MULTILINE).group(1)
data = json.loads(json_text)
assert data['activity']['type'] == 'read'

如果假设是不正确则code失败。

If the assumptions are incorrect then the code fails.

要放松,第二个假设,一个JavaScript解析器可以用来代替正则表达式的EG, SLIMIT (<一个href=\"http://ru.stackoverflow.com/questions/501556/%d0%9a%d0%b0%d0%ba-%d0%bf%d0%be%d0%bb%d1%83%d1%87%d0%b8%d1%82%d1%8c-%d0%b8%d0%bd%d1%84%d0%be%d1%80%d0%bc%d0%b0%d1%86%d0%b8%d1%8e-%d0%b8%d0%b7-%d1%81%d1%82%d1%80%d0%be%d0%ba%d0%b8-json-%d0%ba%d0%be%d1%82%d0%be%d1%80%d0%b0%d1%8f-%d1%83%d0%ba%d0%b0%d0%b7%d0%b0%d0%bd%d0%b0-%d0%b2-javascript-%d0%ba%d0%be%d0%b4%d0%b5-%d0%b2%d0%bd%d1%83%d1%82%d1%80%d0%b8/501630#comment598944_501562\">suggested通过@approximatenumber ):

To relax the second assumption, a javascript parser could be used instead of a regex e.g., slimit (suggested by @approximatenumber):

from slimit import ast  # $ pip install slimit
from slimit.parser import Parser as JavascriptParser
from slimit.visitors import nodevisitor

soup = BeautifulSoup(html, 'html.parser')
tree = JavascriptParser().parse(soup.script.string)
obj = next(node.right for node in nodevisitor.visit(tree)
           if (isinstance(node, ast.Assign) and
               node.left.to_ecma() == 'window.blog.data'))
# HACK: easy way to parse the javascript object literal
data = json.loads(obj.to_ecma())  # NOTE: json format may be slightly different
assert data['activity']['type'] == 'read'

有没有必要当作一个JSON对象的对象文字( OBJ )。为了获得必要的信息, OBJ 可以递归像其他AST节点访问。这将允许支持任意JavaScript code(可以通过 SLIMIT )。

There is no need to treat the object literal (obj) as a json object. To get the necessary info, obj can be visited recursively like other ast nodes. It would allow to support arbitrary javascript code (that can be parsed by slimit).

这篇关于如何提取在HTML页面中使用Python的JavaScript块中定义一个JSON对象?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆