使用 Scrapy 从 javascript 中抓取 [英] Scraping from javascript using Scrapy
本文介绍了使用 Scrapy 从 javascript 中抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我需要使用scrapy用javascript标签抓取内容,如下所示:
I need to scrape the content with javascript tag using scrapy as follows:
<script type='text/javascript' id='script-id'> attribute={"pid":"123","title":"abc","url":"http://example.com","date":"2014-07-31 14:56:39 CDT","channels":["test"],"tags":[],"authors":["james Catcher"]};</script>
我可以使用 xpath 提取内容
I can extract the content using xpath
response.xpath('id("script-id")//text()').extract()
输出
[u'\nattribute = {"pid":"123","title":"abc","url":"http:/example.com","date":"2014-07-30 15:34:10 ","channels":["test"],"tags":[],"authors":["james Watt"]};\n(function( ){\n var s = document.createElement(\'script\');\n s.async = true;\n s.type = \'text/javascript\';\n s.src = document.location.protocol + \'//d8rk54i4mohrb. cloudfront.net/js/reach.js\';\n (document.getElementsByTagName(\'head\')[0] || document.getElementsByTagName(\'body\')[0]).appendChild(s);\n})();\n'']
如何使用 xpath 获取每个值?
How can I get each values using xpath?
推荐答案
这是json,所以可以先从字符串中提取出来,然后用json加载
This is json, so you can first extract it from the string, then load it with json
In [1]: import json
In [2]: sample_string = [u'\n attribute={"pid":"123","title":"abc",'
+'"url":"http:/example.com","date":"2014-07-30 15:34:10 ",'
+'"channels":["test"],"tags":[],"authors":["james Watt"]}'][0]
In [3]: data = json.loads(sample_string[12:])
In [4]: data
Out[4]:
{u'authors': [u'james Watt'],
u'channels': [u'test'],
u'date': u'2014-07-30 15:34:10 ',
u'pid': u'123',
u'tags': [],
u'title': u'abc',
u'url': u'http:/example.com'}
In [5]: data['authors']
Out[5]: [u'james Watt']
或者,您也可以加载像 PyV8 这样的 javascript 引擎来解释这些变量.
Alternatively, you can also load a javascript engine like PyV8 to interpret those variables.
这篇关于使用 Scrapy 从 javascript 中抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文