Scrapy:从HTML脚本中提取JSON [英] Scrapy: extract JSON from within HTML script

查看：55 发布时间：2021/2/13 20:09:29 python html json xpath scrapy

本文介绍了Scrapy:从HTML脚本中提取JSON的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试从HTML脚本中提取(似乎是)JSON数据. HTML脚本在网站上看起来像这样:

I'm trying to extract (what appears to be) JSON data from within an HTML script. The HTML script looks like this on the site:

<script>
  $(document).ready(function(){
    var terms    = new Verba.Compare.Collections.Terms([{"id":"6436","name":"SUMMER 16","inquiry":true,"ordering":true},{"id":"6517","name":"FALL 16","inquiry":true,"ordering":true}]);
    var view     = new Verba.Compare.Views.CourseSelector({el: "body", terms: terms});
  });
</script>

我想提取以下内容:

[{"id":"6436","name":"SUMMER 16","inquiry":true,"ordering":true},{"id":"6517","name":"FALL 16","inquiry":true,"ordering":true}]

使用以下代码，我可以获得完整的脚本.

Using the following code, I'm able to get the full script.

    def parse(self, response):
        print response.xpath('/html/body/script[2]').extract()

有没有一种简单的方法可以从该脚本中提取"id"，"name"等值.或者，通过修改xpath是否有更直接的方法?我似乎无法使用Firebug对xpath进行更深入的研究.

Is there a simple way to then extract the values for "id", "name", etc. from that script. Or, is there a more direct way by modifying the xpath? I can't seem to go any deeper on the xpath using firebug.

推荐答案

为此，您可以使用 js2xml

为说明起见，首先，让我们用示例HTML创建一个Scrapy选择器，并获取JavaScript语句:

To illustrate, first, let's create a Scrapy selector with your sample HTML, and grab the JavaScript statements:

>>> import scrapy
>>> sample = '''<script>
...   $(document).ready(function(){
...     var terms    = new Verba.Compare.Collections.Terms([{"id":"6436","name":"SUMMER 16","inquiry":true,"ordering":true},{"id":"6517","name":"FALL 16","inquiry":true,"ordering":true}]);
...     var view     = new Verba.Compare.Views.CourseSelector({el: "body", terms: terms});
...   });
... </script>'''
>>> selector = scrapy.Selector(text=sample, type='html')
>>> selector.xpath('//script//text()').extract_first()
u'\n  $(document).ready(function(){\n    var terms    = new Verba.Compare.Collections.Terms([{"id":"6436","name":"SUMMER 16","inquiry":true,"ordering":true},{"id":"6517","name":"FALL 16","inquiry":true,"ordering":true}]);\n    var view     = new Verba.Compare.Views.CourseSelector({el: "body", terms: terms});\n  });\n'

然后，我们可以使用js2xml解析JavaScript代码.您会得到一个lxml树:

Then we can parse the JavaScript code with js2xml. You get an lxml tree back:

>>> import js2xml
>>> jssnippet = selector.xpath('//script//text()').extract_first()
>>> jstree = js2xml.parse(jssnippet)
>>> jstree
<Element program at 0x7fc7c6bae1b8>

树长什么样?这很冗长:

What does the tree look like? It's pretty verbose:

>>> print(js2xml.pretty_print(jstree))
<program>
  <functioncall>
    <function>
      <dotaccessor>
        <object>
          <functioncall>
            <function>
              <identifier name="$"/>
            </function>
            <arguments>
              <identifier name="document"/>
            </arguments>
          </functioncall>
        </object>
        <property>
          <identifier name="ready"/>
        </property>
      </dotaccessor>
    </function>
    <arguments>
      <funcexpr>
        <identifier/>
        <parameters/>
        <body>
          <var name="terms">
            <new>
              <dotaccessor>
                <object>
                  <dotaccessor>
                    <object>
                      <dotaccessor>
                        <object>
                          <identifier name="Verba"/>
                        </object>
                        <property>
                          <identifier name="Compare"/>
                        </property>
                      </dotaccessor>
                    </object>
                    <property>
                      <identifier name="Collections"/>
                    </property>
                  </dotaccessor>
                </object>
                <property>
                  <identifier name="Terms"/>
                </property>
              </dotaccessor>
              <arguments>
                <array>
                  <object>
                    <property name="id">
                      <string>6436</string>
                    </property>
                    <property name="name">
                      <string>SUMMER 16</string>
                    </property>
                    <property name="inquiry">
                      <boolean>true</boolean>
                    </property>
                    <property name="ordering">
                      <boolean>true</boolean>
                    </property>
                  </object>
                  <object>
                    <property name="id">
                      <string>6517</string>
                    </property>
                    <property name="name">
                      <string>FALL 16</string>
                    </property>
                    <property name="inquiry">
                      <boolean>true</boolean>
                    </property>
                    <property name="ordering">
                      <boolean>true</boolean>
                    </property>
                  </object>
                </array>
              </arguments>
            </new>
          </var>
          <var name="view">
            <new>
              <dotaccessor>
                <object>
                  <dotaccessor>
                    <object>
                      <dotaccessor>
                        <object>
                          <identifier name="Verba"/>
                        </object>
                        <property>
                          <identifier name="Compare"/>
                        </property>
                      </dotaccessor>
                    </object>
                    <property>
                      <identifier name="Views"/>
                    </property>
                  </dotaccessor>
                </object>
                <property>
                  <identifier name="CourseSelector"/>
                </property>
              </dotaccessor>
              <arguments>
                <object>
                  <property name="el">
                    <string>body</string>
                  </property>
                  <property name="terms">
                    <identifier name="terms"/>
                  </property>
                </object>
              </arguments>
            </new>
          </var>
        </body>
      </funcexpr>
    </arguments>
  </functioncall>
</program>

您可以使用XPath技能来指向JavaScript数组(您希望将new结构的点"访问器的第一个参数分配给var terms):

You can use your XPath skills to point to the JavaScript array (you want the 1st argument of the "dot" accessor for the new contruct assigned to var terms):

>>> jstree.xpath('//var[@name="terms"]')
[<Element var at 0x7fc7c565e638>]
>>> jstree.xpath('//var[@name="terms"]/new/arguments/*')
[<Element array at 0x7fc7c565e5a8>]
>>> jstree.xpath('//var[@name="terms"]/new/arguments/*')[0]
<Element array at 0x7fc7c565e5a8>

最后，现在您有了<array>元素，您可以将其传递给js2xml.jsonlike.make_dict()以获得一个可以使用的漂亮Python对象(make_dict有点叫错):

Finally, now that you have the <array> element, you can pass it to js2xml.jsonlike.make_dict() to get a nice Python object to work with (make_dict is kinda misnamed):

>>> js2xml.jsonlike.make_dict(jstree.xpath('//var[@name="terms"]/new/arguments/*')[0])
[{'ordering': True, 'inquiry': True, 'id': '6436', 'name': 'SUMMER 16'}, {'ordering': True, 'inquiry': True, 'id': '6517', 'name': 'FALL 16'}]
>>>

注意:您还可以使用快捷方式js2xml.jsonlike.getall()来获取看起来像Python字典或列表的所有内容(您将获得2个列表，您对第一个列表感兴趣):

Note: you can also use the shortcut js2xml.jsonlike.getall() to fetch everything that looks like a Python dict or list (you get 2 lists, you're interested in the 1st one):

>>> js2xml.jsonlike.getall(jstree)
[[{'ordering': True, 'inquiry': True, 'id': '6436', 'name': 'SUMMER 16'}, {'ordering': True, 'inquiry': True, 'id': '6517', 'name': 'FALL 16'}], {'el': 'body', 'terms': 'terms'}]

这篇关于Scrapy:从HTML脚本中提取JSON的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Scrapy:从HTML脚本中提取JSON [英] Scrapy: extract JSON from within HTML script

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

Scrapy:从HTML脚本中提取JSON [英] Scrapy: extract JSON from within HTML script

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭