如何使用Scrapy从JavaScript中提取jsonObj [英] How do I extract a jsonObj out of a javascript with Scrapy

查看:94
本文介绍了如何使用Scrapy从JavaScript中提取jsonObj的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想构建一个jsonObj的字典.到目前为止,这就是我所拥有的.我还没有弄清楚如何提取json以便对其进行解析.

I want to build a dictionary of the jsonObj. Here's what I have so far. I've not yet figured out how to extract the json in order to parse it.

    def parse_store(self, response):
    jsonobj = response.xpath('//script[@window.appData//text').extract()
    stores = json.loads(jsonobj.body_as_unicode())
    print(stores)
    for stores in response:
        stores = {}
        stores['stores'] = response['stores']
        stores['stores']['id'] = response['stores']['id']
        stores['stores']['name'] = response['stores']['name']
        stores['stores']['addr1'] = response['stores']['addr1']
        stores['stores']['city'] = response['stores']['city']
        stores['stores']['state'] = response['stores']['state']
        stores['stores']['country'] = response['stores']['country']
        stores['stores']['zipCode'] = response['stores']['zipCode']
        stores['stores']['phone'] = response['stores']['phone']
        stores['stores']['latitude'] = response['stores']['latitude']
        stores['stores']['longitude'] = response['stores']['longitude']
        stores['stores']['services'] = response['stores']['services']
    print(stores)

    return stores

推荐答案

一种方法是使用 js2xml (免责声明:我写了js2xml)

One way to do this is to use js2xml (disclaimer: I wrote js2xml)

因此,假设您有一个带有<script>元素且包含一些JavaScript数据的scrapy Selector:

So let's assume you have a scrapy Selector with a <script> element with some JavaScript data:

>>> import scrapy
>>> html = '''<script>
... window.appData = {
...     "stores": [
...     {   "id": "952",
...         "name": "BAYTOWN TX",
...         "addr1": "4620 garth rd",
...         "city": "baytown",
...         "state": "TX",
...         "country": "US",
...         "zipCode": "77521",
...         "phone": "281-420-0079",
...         "locationType": "Store",
...         "locationSubType": "Big Box Store",
...         "latitude": "29.77313",
...         "longitude": "-94.97634"
...     }]
... }
... </script>'''
>>> selector = scrapy.Selector(text=html, type="html")

让我们从中提取该JavaScript位:

Let's extract that JavaScript bit from it:

>>> js = selector.xpath('//script/text()').extract_first()
>>> js
u'\nwindow.appData = {\n    "stores": [\n    {   "id": "952",\n        "name": "BAYTOWN TX",\n        "addr1": "4620 garth rd",\n        "city": "baytown",\n        "state": "TX",\n        "country": "US",\n        "zipCode": "77521",\n        "phone": "281-420-0079",\n        "locationType": "Store",\n        "locationSubType": "Big Box Store",\n        "latitude": "29.77313",\n        "longitude": "-94.97634"\n    }]\n}\n'

现在,导入js2xml并调用.parse()函数.您会得到一个lxml树,代表了JavaScript代码(其中的 AST ):

Now, import js2xml and call the .parse() function. You get an lxml tree back, representing the JavaScript code (sort of the AST of it):

>>> import js2xml
>>> jstree = js2xml.parse(js)
>>> jstree
<Element program at 0x7fc7f1ba3bd8>

如果您好奇的话,这棵树是这样的:

If you're curious, here's what the tree looks like:

>>> print(js2xml.pretty_print(jstree))
<program>
  <assign operator="=">
    <left>
      <dotaccessor>
        <object>
          <identifier name="window"/>
        </object>
        <property>
          <identifier name="appData"/>
        </property>
      </dotaccessor>
    </left>
    <right>
      <object>
        <property name="stores">
          <array>
            <object>
              <property name="id">
                <string>952</string>
              </property>
              <property name="name">
                <string>BAYTOWN TX</string>
              </property>
              <property name="addr1">
                <string>4620 garth rd</string>
              </property>
              <property name="city">
                <string>baytown</string>
              </property>
              <property name="state">
                <string>TX</string>
              </property>
              <property name="country">
                <string>US</string>
              </property>
              <property name="zipCode">
                <string>77521</string>
              </property>
              <property name="phone">
                <string>281-420-0079</string>
              </property>
              <property name="locationType">
                <string>Store</string>
              </property>
              <property name="locationSubType">
                <string>Big Box Store</string>
              </property>
              <property name="latitude">
                <string>29.77313</string>
              </property>
              <property name="longitude">
                <string>-94.97634</string>
              </property>
            </object>
          </array>
        </property>
      </object>
    </right>
  </assign>
</program>

然后,您想获得window.appData(JavaScript对象)分配的正确部分. 您可以使用常规XPath调用来选择此选项:

Then, you want to get the right part of the assignment of window.appData, a JavaScript object. You can use regular XPath call to select this:

>>> jstree.xpath('''
...     //assign[left//identifier[@name="appData"]]
...         /right
...             /*
...     ''')
[<Element object at 0x7fc7f257f5f0>]
>>> 

(即您想要<assign>节点,在<left>部分进行过滤,并获得<right>部分的子级,即<object>)

(i.e. you want the <assign> node, filtering on the <left> part, and get the child of the <right> part, which is an <object>)

js2xml具有可将<object>节点转换为Python字典和列表的助手(我们用[0]选择xpath()调用的第一个结果):

js2xml has helpers to convert <object> nodes into Python dicts and lists (we select the first result of the xpath() call with [0]):

>>> js2xml.make_dict(jstree.xpath('//assign[left//identifier[@name="appData"]]/right/*')[0])
>>> from pprint import pprint
>>> pprint(js2xml.jsonlike.make_dict(jstree.xpath('//assign[left//identifier[@name="appData"]]/right/*')[0]))
{'stores': [{'addr1': '4620 garth rd',
             'city': 'baytown',
             'country': 'US',
             'id': '952',
             'latitude': '29.77313',
             'locationSubType': 'Big Box Store',
             'locationType': 'Store',
             'longitude': '-94.97634',
             'name': 'BAYTOWN TX',
             'phone': '281-420-0079',
             'state': 'TX',
             'zipCode': '77521'}]}
>>> 

这篇关于如何使用Scrapy从JavaScript中提取jsonObj的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆