如何使用Scrapy从JavaScript中提取jsonObj [英] How do I extract a jsonObj out of a javascript with Scrapy
问题描述
我想构建一个jsonObj的字典.到目前为止,这就是我所拥有的.我还没有弄清楚如何提取json以便对其进行解析.
I want to build a dictionary of the jsonObj. Here's what I have so far. I've not yet figured out how to extract the json in order to parse it.
def parse_store(self, response):
jsonobj = response.xpath('//script[@window.appData//text').extract()
stores = json.loads(jsonobj.body_as_unicode())
print(stores)
for stores in response:
stores = {}
stores['stores'] = response['stores']
stores['stores']['id'] = response['stores']['id']
stores['stores']['name'] = response['stores']['name']
stores['stores']['addr1'] = response['stores']['addr1']
stores['stores']['city'] = response['stores']['city']
stores['stores']['state'] = response['stores']['state']
stores['stores']['country'] = response['stores']['country']
stores['stores']['zipCode'] = response['stores']['zipCode']
stores['stores']['phone'] = response['stores']['phone']
stores['stores']['latitude'] = response['stores']['latitude']
stores['stores']['longitude'] = response['stores']['longitude']
stores['stores']['services'] = response['stores']['services']
print(stores)
return stores
推荐答案
一种方法是使用 js2xml (免责声明:我写了js2xml)
One way to do this is to use js2xml (disclaimer: I wrote js2xml)
因此,假设您有一个带有<script>
元素且包含一些JavaScript数据的scrapy Selector:
So let's assume you have a scrapy Selector with a <script>
element with some JavaScript data:
>>> import scrapy
>>> html = '''<script>
... window.appData = {
... "stores": [
... { "id": "952",
... "name": "BAYTOWN TX",
... "addr1": "4620 garth rd",
... "city": "baytown",
... "state": "TX",
... "country": "US",
... "zipCode": "77521",
... "phone": "281-420-0079",
... "locationType": "Store",
... "locationSubType": "Big Box Store",
... "latitude": "29.77313",
... "longitude": "-94.97634"
... }]
... }
... </script>'''
>>> selector = scrapy.Selector(text=html, type="html")
让我们从中提取该JavaScript位:
Let's extract that JavaScript bit from it:
>>> js = selector.xpath('//script/text()').extract_first()
>>> js
u'\nwindow.appData = {\n "stores": [\n { "id": "952",\n "name": "BAYTOWN TX",\n "addr1": "4620 garth rd",\n "city": "baytown",\n "state": "TX",\n "country": "US",\n "zipCode": "77521",\n "phone": "281-420-0079",\n "locationType": "Store",\n "locationSubType": "Big Box Store",\n "latitude": "29.77313",\n "longitude": "-94.97634"\n }]\n}\n'
现在,导入js2xml并调用.parse()
函数.您会得到一个lxml树,代表了JavaScript代码(其中的 AST ):
Now, import js2xml and call the .parse()
function. You get an lxml tree back, representing the JavaScript code (sort of the AST of it):
>>> import js2xml
>>> jstree = js2xml.parse(js)
>>> jstree
<Element program at 0x7fc7f1ba3bd8>
如果您好奇的话,这棵树是这样的:
If you're curious, here's what the tree looks like:
>>> print(js2xml.pretty_print(jstree))
<program>
<assign operator="=">
<left>
<dotaccessor>
<object>
<identifier name="window"/>
</object>
<property>
<identifier name="appData"/>
</property>
</dotaccessor>
</left>
<right>
<object>
<property name="stores">
<array>
<object>
<property name="id">
<string>952</string>
</property>
<property name="name">
<string>BAYTOWN TX</string>
</property>
<property name="addr1">
<string>4620 garth rd</string>
</property>
<property name="city">
<string>baytown</string>
</property>
<property name="state">
<string>TX</string>
</property>
<property name="country">
<string>US</string>
</property>
<property name="zipCode">
<string>77521</string>
</property>
<property name="phone">
<string>281-420-0079</string>
</property>
<property name="locationType">
<string>Store</string>
</property>
<property name="locationSubType">
<string>Big Box Store</string>
</property>
<property name="latitude">
<string>29.77313</string>
</property>
<property name="longitude">
<string>-94.97634</string>
</property>
</object>
</array>
</property>
</object>
</right>
</assign>
</program>
然后,您想获得window.appData
(JavaScript对象)分配的正确部分.
您可以使用常规XPath调用来选择此选项:
Then, you want to get the right part of the assignment of window.appData
, a JavaScript object.
You can use regular XPath call to select this:
>>> jstree.xpath('''
... //assign[left//identifier[@name="appData"]]
... /right
... /*
... ''')
[<Element object at 0x7fc7f257f5f0>]
>>>
(即您想要<assign>
节点,在<left>
部分进行过滤,并获得<right>
部分的子级,即<object>
)
(i.e. you want the <assign>
node, filtering on the <left>
part, and get the child of the <right>
part, which is an <object>
)
js2xml具有可将<object>
节点转换为Python字典和列表的助手(我们用[0]
选择xpath()调用的第一个结果):
js2xml has helpers to convert <object>
nodes into Python dicts and lists (we select the first result of the xpath() call with [0]
):
>>> js2xml.make_dict(jstree.xpath('//assign[left//identifier[@name="appData"]]/right/*')[0])
>>> from pprint import pprint
>>> pprint(js2xml.jsonlike.make_dict(jstree.xpath('//assign[left//identifier[@name="appData"]]/right/*')[0]))
{'stores': [{'addr1': '4620 garth rd',
'city': 'baytown',
'country': 'US',
'id': '952',
'latitude': '29.77313',
'locationSubType': 'Big Box Store',
'locationType': 'Store',
'longitude': '-94.97634',
'name': 'BAYTOWN TX',
'phone': '281-420-0079',
'state': 'TX',
'zipCode': '77521'}]}
>>>
这篇关于如何使用Scrapy从JavaScript中提取jsonObj的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!