Scrapy,在 Javascript 中抓取数据 [英] Scrapy, scraping data inside a Javascript

查看:32
本文介绍了Scrapy,在 Javascript 中抓取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 scrapy 来筛选来自网站的抓取数据.但是,我想要的数据不在 html 内部,而是来自 javascript.所以,我的问题是:

I am using scrapy to screen scrape data from a website. However, the data I wanted wasn't inside the html itself, instead, it is from a javascript. So, my question is:

如何获取此类案例的值(文本值)?

How to get the values (text values) of such cases?

这是我要进行屏幕抓取的网站:https://www.mcdonalds.com.sg/locate-us/

This, is the site I'm trying to screen scrape: https://www.mcdonalds.com.sg/locate-us/

我想获得的属性:地址、联系方式、营业时间.

Attributes I'm trying to get: Address, Contact, Operating hours.

如果您在 chrome 浏览器中执行右键单击"、查看源代码",您将看到这些值本身在 HTML 中不可用.

If you do a "right click", "view source" inside a chrome browser you will see that such values aren't available itself in the HTML.

编辑

对不起,保罗,我按照你说的做了,找到了 admin-ajax.php 并看到了尸体,但是,我现在真的卡住了.

Sry paul, i did what you told me to, found the admin-ajax.php and saw the body but, I'm really stuck now.

如何从 json 对象中检索值并将其存储到我自己的变量字段中?如果你能分享如何为公众和那些刚开始使用scrapy 的人做一个属性,那就太好了.

How do I retrieve the values from the json object and store it into a variable field of my own? It would be good, if you could share how to do just one attribute for the public and to those who just started scrapy as well.

这是我目前的代码

Items.py

class McDonaldsItem(Item):
name = Field()
address = Field()
postal = Field()
hours = Field()

麦当劳.py

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
import re

from fastfood.items import McDonaldsItem

class McDonaldSpider(BaseSpider):
name = "mcdonalds"
allowed_domains = ["mcdonalds.com.sg"]
start_urls = ["https://www.mcdonalds.com.sg/locate-us/"]

def parse_json(self, response):

    js = json.loads(response.body)
    pprint.pprint(js)

Sry 进行长时间编辑,简而言之,我如何将 json 值存储到我的属性中?例如

Sry for long edit, so in short, how do i store the json value into my attribute? for eg

***item['address'] = * 如何检索 ****

***item['address'] = * how to retrieve ****

P.S,不确定这是否有帮助,但是,我使用

P.S, not sure if this helps but, i run these scripts on the cmd line using

scrapy crawl mcdonalds -o McDonalds.json -t json(将我的所有数据保存到一个 json 文件中)

scrapy crawl mcdonalds -o McDonalds.json -t json ( to save all my data into a json file )

我对自己的感激之情再怎么强调也不为过.我知道问你这个问题有点不合理,即使你没有时间也完全没问题.

I cannot stress enough on how thankful i feel. I know it's kind of unreasonable to ask this of u, will totally be okay even if you dont have time for this.

推荐答案

(我将此发布到 scrapy-users 邮件列表,但根据 Paul 的建议,我将其发布在这里,因为它补充了shell 命令交互的答案.)

(I posted this to scrapy-users mailing list but by Paul's suggestion I'm posting it here as it complements the answer with the shell command interaction.)

通常,使用第三方服务来呈现一些数据可视化(地图、表格等)的网站必须以某种方式发送数据,并且在大多数情况下,这些数据可以通过浏览器访问.

Generally, websites that use a third party service to render some data visualization (map, table, etc) have to send the data somehow, and in most cases this data is accessible from the browser.

对于这种情况,检查(即探索浏览器发出的请求)显示数据是从 POST 请求加载到 https://www.mcdonalds.com.sg/wp-admin/admin-ajax.php

For this case, an inspection (i.e. exploring the requests made by the browser) shows that the data is loaded from a POST request to https://www.mcdonalds.com.sg/wp-admin/admin-ajax.php

所以,基本上你有你想要的所有数据,它们是一个漂亮的 json 格式,可供消费.

So, basically you have there all the data you want in a nice json format ready for consuming.

Scrapy 提供了 shell 命令,非常方便思想者在编写蜘蛛之前使用网站:

Scrapy provides the shell command which is very convenient to thinker with the website before writing the spider:

$ scrapy shell https://www.mcdonalds.com.sg/locate-us/
2013-09-27 00:44:14-0400 [scrapy] INFO: Scrapy 0.16.5 started (bot: scrapybot)
...

In [1]: from scrapy.http import FormRequest

In [2]: url = 'https://www.mcdonalds.com.sg/wp-admin/admin-ajax.php'

In [3]: payload = {'action': 'ws_search_store_location', 'store_name':'0', 'store_area':'0', 'store_type':'0'}

In [4]: req = FormRequest(url, formdata=payload)

In [5]: fetch(req)
2013-09-27 00:45:13-0400 [default] DEBUG: Crawled (200) <POST https://www.mcdonalds.com.sg/wp-admin/admin-ajax.php> (referer: None)
...

In [6]: import json

In [7]: data = json.loads(response.body)

In [8]: len(data['stores']['listing'])
Out[8]: 127

In [9]: data['stores']['listing'][0]
Out[9]: 
{u'address': u'678A Woodlands Avenue 6<br/>#01-05<br/>Singapore 731678',
 u'city': u'Singapore',
 u'id': 78,
 u'lat': u'1.440409',
 u'lon': u'103.801489',
 u'name': u"McDonald's Admiralty",
 u'op_hours': u'24 hours<br>\r\nDessert Kiosk: 0900-0100',
 u'phone': u'68940513',
 u'region': u'north',
 u'type': [u'24hrs', u'dessert_kiosk'],
 u'zip': u'731678'}

简而言之:在您的蜘蛛中,您必须返回上面的 FormRequest(...),然后在回调中从 response.body 加载 json 对象,最后为列表中每个商店的数据 data['stores']['listing'] 创建一个具有所需值的项目.

In short: in your spider you have to return the FormRequest(...) above, then in the callback load the json object from response.body and finally for each store's data in the list data['stores']['listing'] create an item with the wanted values.

像这样:

class McDonaldSpider(BaseSpider):
    name = "mcdonalds"
    allowed_domains = ["mcdonalds.com.sg"]
    start_urls = ["https://www.mcdonalds.com.sg/locate-us/"]

    def parse(self, response):
        # This receives the response from the start url. But we don't do anything with it.
        url = 'https://www.mcdonalds.com.sg/wp-admin/admin-ajax.php'
        payload = {'action': 'ws_search_store_location', 'store_name':'0', 'store_area':'0', 'store_type':'0'}
        return FormRequest(url, formdata=payload, callback=self.parse_stores)

    def parse_stores(self, response):
        data = json.loads(response.body)
        for store in data['stores']['listing']:
            yield McDonaldsItem(name=store['name'], address=store['address'])

这篇关于Scrapy,在 Javascript 中抓取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆