解析可变数据出来一个javascript标签使用python的 [英] Parsing variable data out of a javascript tag using python

查看:212
本文介绍了解析可变数据出来一个javascript标签使用python的的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刮使用BeautifulSoup和请求一些网站。有一个页面,我检查,有一个&LT内其数据; SCRIPT LANGUAGE =JavaScript的类型=文/ JavaScript的> 标记。它看起来是这样的:

I am scraping some websites using BeautifulSoup and Requests. There is one page that I am examining that has its data inside of a <script language="JavaScript" type="text/javascript"> tag. It looks like this:

<script language="JavaScript" type="text/javascript">
var page_data = {
   "default_sku" : "SKU12345",
   "get_together" : {
      "imageLargeURL" : "http://null.null/pictures/large.jpg",
      "URL" : "http://null.null/index.tmpl",
      "name" : "Paints",
      "description" : "Here is a description and it works pretty well",
      "canFavorite" : 1,
      "id" : 1234,
      "type" : 2,
      "category" : "faded",
      "imageThumbnailURL" : "http://null.null/small9.jpg"
       ......

有没有我可以创建这个脚本标签内的Python字典或JSON对象了 page_data 变量的方法吗?这将是好得多,然后试图用BeautifulSoup获得的值。

Is there a way that I can create a python dictionary or json object out of the page_data variable within this script tag? That would be much nicer then trying to obtain values with BeautifulSoup.

推荐答案

如果您使用BeautifulSoup来获取&LT的内容;脚本&GT; 标记中,的 JSON 模块可以用一点神奇串完成剩下的

If you use BeautifulSoup to get the contents of the <script> tag, the json module can do the rest with a bit of string magic:

 jsonValue = '{%s}' % (textValue.split('{', 1)[1].rsplit('}', 1)[0],)
 value = json.loads(jsonValue)

.split() .rsplit()以上组合拆分的第一个<$ C文本$ C> {并在最后} 中的JavaScript的文本块,这应该是你的对象定义。通过添加括号后面的文本,我们可以把它交给 json.loads () 并从它那里得到一个蟒蛇结构。

The .split() and .rsplit() combo above split the text on the first { and on the last } in the JavaScript text block, which should be your object definition. By adding the braces back to the text we can feed it to json.loads() and get a python structure from it.

示范:

>>> import json
>>> textValue = '''
... var page_data = {
...    "default_sku" : "SKU12345",
...    "get_together" : {
...       "imageLargeURL" : "http://null.null/pictures/large.jpg",
...       "URL" : "http://null.null/index.tmpl",
...       "name" : "Paints",
...       "description" : "Here is a description and it works pretty well",
...       "canFavorite" : 1,
...       "id" : 1234,
...       "type" : 2,
...       "category" : "faded",
...       "imageThumbnailURL" : "http://null.null/small9.jpg"
...    }
... };
... '''
>>> jsonValue = '{%s}' % (textValue.split('{', 1)[1].rsplit('}', 1)[0],)
>>> value = json.loads(jsonValue)
>>> value
{u'default_sku': u'SKU12345', u'get_together': {u'category': u'faded', u'canFavorite': 1, u'name': u'Paints', u'URL': u'http://null.null/index.tmpl', u'imageThumbnailURL': u'http://null.null/small9.jpg', u'imageLargeURL': u'http://null.null/pictures/large.jpg', u'type': 2, u'id': 1234, u'description': u'Here is a description and it works pretty well'}}
>>> import pprint
>>> pprint.pprint(value)
{u'default_sku': u'SKU12345',
 u'get_together': {u'URL': u'http://null.null/index.tmpl',
                   u'canFavorite': 1,
                   u'category': u'faded',
                   u'description': u'Here is a description and it works pretty well',
                   u'id': 1234,
                   u'imageLargeURL': u'http://null.null/pictures/large.jpg',
                   u'imageThumbnailURL': u'http://null.null/small9.jpg',
                   u'name': u'Paints',
                   u'type': 2}}

这篇关于解析可变数据出来一个javascript标签使用python的的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆