Web抓取未知数据结构(JSON,嵌套列表或其他内容?) [英] Web scraping unknown data structure (JSON, nested list, or something else?)

查看:79
本文介绍了Web抓取未知数据结构(JSON,嵌套列表或其他内容?)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我为

I built a web scraper for this page that hinged on parsing a string as JSON file. But they've made some updates to the site and now the scraper has stopped working. I think the issue is that the information I need is no longer structured as JSON.

这是我原来的东西:

# Packages
from bs4 import BeautifulSoup
from urllib.request import urlopen, urlretrieve
import json
import ast

# The part that still works
address = 'https://campus.datacamp.com/courses/intro-to-python-for-data-science/chapter-1-python-basics?ex=2' 
html = urlopen(address)
soup = BeautifulSoup(html, 'lxml')
string = soup.find_all('script')[2].string
json_text = string.strip('window.PRELOADED_STATE = "')[:-2]

# The part that's now broken
lesson = json.loads(json_text)

#> Traceback (most recent call last):
#> <ipython-input-11-f9b7d249d994> in <module>()
#>       2 # The part that's now broken
#>       3 
#> ----> 4 lesson = json.loads(json_text)
#> ~/anaconda3/lib/python3.6/json/__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
#>     352             parse_int is None and parse_float is None and
#>     353             parse_constant is None and object_pairs_hook is None and not kw):
#> --> 354         return _default_decoder.decode(s)
#>     355     if cls is None:
#>     356         cls = JSONDecoder
#> ~/anaconda3/lib/python3.6/json/decoder.py in decode(self, s, _w)
#>     337 
#>     338         """
#> --> 339         obj, end = self.raw_decode(s, idx=_w(s, 0).end())
#>     340         end = _w(s, end).end()
#>     341         if end != len(s):
#> ~/anaconda3/lib/python3.6/json/decoder.py in raw_decode(self, s, idx)
#>     355             obj, end = self.scan_once(s, idx)
#>     356         except StopIteration as err:
#> --> 357             raise JSONDecodeError("Expecting value", s, err.value) from None
#>     358         return obj, end
#> JSONDecodeError: Expecting value: line 1 column 2 (char 1)

问题在于json_text中的所有信息都不再被构造为JSON.

The issue is that all the information in json_text is no longer structured as a JSON.

need_to_parse = BeautifulSoup(json_text, 'lxml').string #Escape HTML
print(len(need_to_parse))
#> 61453
print(need_to_parse[:50])
#> ["~#iM",["preFetchedData",["^0",["course",["^0",["
print(need_to_parse[-50:])
#> "type","MultipleChoiceExercise","id",14253]]]]]]]]

我以为可能是嵌套列表,所以我尝试了ast.literal_eval(),但是没有运气!

I thought maybe is was a nested list, so I tried ast.literal_eval(), but no luck!

parsed_list = ast.literal_eval(need_to_parse)
#> Traceback (most recent call last):
#>   File "/Users/nicholascifuentes-goodbody/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2862, in run_code
#>     exec(code_obj, self.user_global_ns, self.user_ns)
#>   File "<ipython-input-13-55b60da762d6>", line 2, in <module>
#>     parsed_list = ast.literal_eval(need_to_parse)
#>   File "/Users/nicholascifuentes-goodbody/anaconda3/lib/python3.6/ast.py", line 48, in literal_eval
#>     node_or_string = parse(node_or_string, mode='eval')
#>   File "/Users/nicholascifuentes-goodbody/anaconda3/lib/python3.6/ast.py", line 35, in parse
#>     return compile(source, filename, mode, PyCF_ONLY_AST)
#>   File "<unknown>", line 1
#>     ["~#iM",["preFetchedData"

完整输出在txt文件中这里 a>.

The full output is in a txt file HERE.

有人可以识别此数据结构吗?解析它的最佳方法是什么?

Does anyone recognize this data structure? What's the best way to parse it?

reprexpy软件包

import reprexpy
print(reprexpy.SessionInfo())
#> Session info --------------------------------------------------------------------
#> Platform: Darwin-17.7.0-x86_64-i386-64bit (64-bit)
#> Python: 3.6
#> Date: 2018-10-19
#> Packages ------------------------------------------------------------------------
#> beautifulsoup4==4.6.0
#> reprexpy==0.1.1

推荐答案

数据结构是一个(嵌套数组的)Javascript数组,序列化为字符串,并且转义了html实体.

The data structure is a Javascript array (of nested arrays), serialised to a string and with html entities escaped.

在浏览器控制台中,您可以取消转义,然后在未转义的字符串上调用eval以获得数组.

In your browser console, you can unescape it and call eval on the unescaped string to get the array.

对我来说,ast.literal_eval引发SyntaxError,因此该字符串必须包含无效的Python语法的Javascript元素.即使不是,ast.literal_eval仍可能在语法上有效的Python但非法值的Javascript元素上失败,例如null或带有未加引号的键的对象.

For me, ast.literal_eval raises SyntaxError, so the string must contain Javascript elements which are not valid Python syntax. Even if it didn't, ast.literal_eval could still fail on Javascript elements that are syntactically valid Python but illegal values, for example null or objects with unquoted keys.

要解析它,您需要使用Javascript解析器,或者找到一个解析Javascript的Python工具-此问题的答案列出了一些内容,但请注意,该内容自2014年以来已经关闭,因此可能会有更新的解决方案可用.

To parse it you need to shell out to a Javascript parser, or find a Python tool that parses Javascript - the answers to this question lists some, but note that it has been closed since 2014, so there may be newer solutions available.

这篇关于Web抓取未知数据结构(JSON,嵌套列表或其他内容?)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆