Scrapy,在 Javascript 中抓取数据 [英] Scrapy, scraping data inside a Javascript
问题描述
我正在使用 scrapy
来筛选来自网站的抓取数据.但是,我想要的数据不在 html 内部,而是来自 javascript.所以,我的问题是:
I am using scrapy
to screen scrape data from a website. However, the data I wanted wasn't inside the html itself, instead, it is from a javascript. So, my question is:
如何获取此类案例的值(文本值)?
How to get the values (text values) of such cases?
这是我要进行屏幕抓取的网站:https://www.mcdonalds.com.sg/locate-us/
This, is the site I'm trying to screen scrape: https://www.mcdonalds.com.sg/locate-us/
我想获得的属性:地址、联系方式、营业时间.
Attributes I'm trying to get: Address, Contact, Operating hours.
如果您在 chrome 浏览器中执行右键单击"、查看源代码",您将看到这些值本身在 HTML 中不可用.
If you do a "right click", "view source" inside a chrome browser you will see that such values aren't available itself in the HTML.
编辑
对不起,保罗,我按照你说的做了,找到了 admin-ajax.php
并看到了尸体,但是,我现在真的卡住了.
Sry paul, i did what you told me to, found the admin-ajax.php
and saw the body but, I'm really stuck now.
如何从 json 对象中检索值并将其存储到我自己的变量字段中?如果你能分享如何为公众和那些刚开始使用scrapy 的人做一个属性,那就太好了.
How do I retrieve the values from the json object and store it into a variable field of my own? It would be good, if you could share how to do just one attribute for the public and to those who just started scrapy as well.
这是我目前的代码
Items.py
class McDonaldsItem(Item):
name = Field()
address = Field()
postal = Field()
hours = Field()
麦当劳.py
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
import re
from fastfood.items import McDonaldsItem
class McDonaldSpider(BaseSpider):
name = "mcdonalds"
allowed_domains = ["mcdonalds.com.sg"]
start_urls = ["https://www.mcdonalds.com.sg/locate-us/"]
def parse_json(self, response):
js = json.loads(response.body)
pprint.pprint(js)
Sry 进行长时间编辑,简而言之,我如何将 json 值存储到我的属性中?例如
Sry for long edit, so in short, how do i store the json value into my attribute? for eg
***item['address'] = * 如何检索 ****
***item['address'] = * how to retrieve ****
P.S,不确定这是否有帮助,但是,我使用
P.S, not sure if this helps but, i run these scripts on the cmd line using
scrapy crawl mcdonalds -o McDonalds.json -t json(将我的所有数据保存到一个 json 文件中)
scrapy crawl mcdonalds -o McDonalds.json -t json ( to save all my data into a json file )
我对自己的感激之情再怎么强调也不为过.我知道问你这个问题有点不合理,即使你没有时间也完全没问题.
I cannot stress enough on how thankful i feel. I know it's kind of unreasonable to ask this of u, will totally be okay even if you dont have time for this.
推荐答案
(我将此发布到 scrapy-users
邮件列表,但根据 Paul 的建议,我将其发布在这里,因为它补充了shell
命令交互的答案.)
(I posted this to scrapy-users
mailing list but by Paul's suggestion I'm posting it here as it complements the answer with the shell
command interaction.)
通常,使用第三方服务来呈现一些数据可视化(地图、表格等)的网站必须以某种方式发送数据,并且在大多数情况下,这些数据可以通过浏览器访问.
Generally, websites that use a third party service to render some data visualization (map, table, etc) have to send the data somehow, and in most cases this data is accessible from the browser.
对于这种情况,检查(即探索浏览器发出的请求)显示数据是从 POST 请求加载到 https://www.mcdonalds.com.sg/wp-admin/admin-ajax.php
For this case, an inspection (i.e. exploring the requests made by the browser) shows that the data is loaded from a POST request to https://www.mcdonalds.com.sg/wp-admin/admin-ajax.php
所以,基本上你有你想要的所有数据,它们是一个漂亮的 json 格式,可供消费.
So, basically you have there all the data you want in a nice json format ready for consuming.
Scrapy 提供了 shell
命令,非常方便思想者在编写蜘蛛之前使用网站:
Scrapy provides the shell
command which is very convenient to thinker with the website before writing the spider:
$ scrapy shell https://www.mcdonalds.com.sg/locate-us/
2013-09-27 00:44:14-0400 [scrapy] INFO: Scrapy 0.16.5 started (bot: scrapybot)
...
In [1]: from scrapy.http import FormRequest
In [2]: url = 'https://www.mcdonalds.com.sg/wp-admin/admin-ajax.php'
In [3]: payload = {'action': 'ws_search_store_location', 'store_name':'0', 'store_area':'0', 'store_type':'0'}
In [4]: req = FormRequest(url, formdata=payload)
In [5]: fetch(req)
2013-09-27 00:45:13-0400 [default] DEBUG: Crawled (200) <POST https://www.mcdonalds.com.sg/wp-admin/admin-ajax.php> (referer: None)
...
In [6]: import json
In [7]: data = json.loads(response.body)
In [8]: len(data['stores']['listing'])
Out[8]: 127
In [9]: data['stores']['listing'][0]
Out[9]:
{u'address': u'678A Woodlands Avenue 6<br/>#01-05<br/>Singapore 731678',
u'city': u'Singapore',
u'id': 78,
u'lat': u'1.440409',
u'lon': u'103.801489',
u'name': u"McDonald's Admiralty",
u'op_hours': u'24 hours<br>\r\nDessert Kiosk: 0900-0100',
u'phone': u'68940513',
u'region': u'north',
u'type': [u'24hrs', u'dessert_kiosk'],
u'zip': u'731678'}
简而言之:在您的蜘蛛中,您必须返回上面的 FormRequest(...)
,然后在回调中从 response.body
加载 json 对象,最后为列表中每个商店的数据 data['stores']['listing']
创建一个具有所需值的项目.
In short: in your spider you have to return the FormRequest(...)
above, then in the callback load the json object from response.body
and finally for each store's data in the list data['stores']['listing']
create an item with the wanted values.
像这样:
class McDonaldSpider(BaseSpider):
name = "mcdonalds"
allowed_domains = ["mcdonalds.com.sg"]
start_urls = ["https://www.mcdonalds.com.sg/locate-us/"]
def parse(self, response):
# This receives the response from the start url. But we don't do anything with it.
url = 'https://www.mcdonalds.com.sg/wp-admin/admin-ajax.php'
payload = {'action': 'ws_search_store_location', 'store_name':'0', 'store_area':'0', 'store_type':'0'}
return FormRequest(url, formdata=payload, callback=self.parse_stores)
def parse_stores(self, response):
data = json.loads(response.body)
for store in data['stores']['listing']:
yield McDonaldsItem(name=store['name'], address=store['address'])
这篇关于Scrapy,在 Javascript 中抓取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!