使用scrapy提取XHR请求? [英] Using scrapy to extract XHR request?

查看:73
本文介绍了使用scrapy提取XHR请求?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取使用 javascript 生成的社交类计数.如果我绝对引用 XHR url,我就能够抓取所需的数据.但是我试图抓取的站点会动态生成这些 XMLHttpRequest,其中包含我不知道如何提取的查询字符串参数.

I'm trying to scrape social like counts that are being generated with javascript. I am able to scrape the desired data if I absolutely reference the XHR url. But the site I am trying to scrape dynamically generates these XMLHttpRequests with query string parameters that I do not know how to extract.

例如,可以看到使用每个页面唯一的 m、p、i 和 g 参数来构造请求 url.

For example, you can see that using the m, p, i, and g parameters unique to each page are used to construct the request url.

这是组合的网址:

http://aeon.co/magazine/social/social.php?url=http://aeon.co/magazine/technology/the-elon-musk-interview-on-mars/&m=1385983411&p=1412056831&i=25829&g=http://aeon.co/magazine/?p=25829

...返回此 JSON:

..which returns this JSON:

{"twitter":13325,"facebook":23481,"googleplusone":964,"disqus":272}

使用以下脚本,我能够从我刚刚提到的请求 url 中提取所需的数据(在本例中为 twitter 计数),但仅限于该特定页面.

Using the following script, I am able to extract desired data (in this case twitter count) from the request url i just mentioned but only for that specific page.

import scrapy

from aeon.items import AeonItem
import json
from scrapy.http.request import Request

class AeonSpider(scrapy.Spider):
    name = "aeon"
    allowed_domains = ["aeon.co"]
    start_urls = [
        "http://aeon.co/magazine/technology"
]

def parse(self, response):
    items = []
    for sel in response.xpath('//*[@id="latestPosts"]/div/div/div'):
        item = AeonItem()
        item['title'] = sel.xpath('./a/p[1]/text()').extract()
        item['primary_url'] = sel.xpath('./a/@href').extract() 
        item['word_count'] = sel.xpath('./a/div/span[2]/text()').extract()      

        for each in item['primary_url']:
            yield Request(http://aeon.co/magazine/social/social.php?url=http://aeon.co/magazine/technology/the-elon-musk-interview-on-mars/&m=1385983411&p=1412056831&i=25829&g=http://aeon.co/magazine/?p=25829, callback=self.parse_XHR_data,meta={'item':item})                   


def XHR_data(self, response):
    jsonresponse = json.loads(response.body_as_unicode())
    item = response.meta['item']
    item["tw_count"] = jsonresponse["twitter"]  
    yield item    

所以我的问题是,如何提取 m、p、i 和 g url 查询参数,以便我可以动态模拟请求 url?(而不是如上所示绝对引用它)

so my question is, how can I extract the m,p,i and g url query parameters so that I can dynamically simulate the request url? (rather than absolutely referencing it as shown above)

推荐答案

这是提取网址的方法:

import urlparse
url = 'http://aeon.co/magazine/social/social.php?url=http://aeon.co/magazine/technology/the-elon-musk-interview-on-mars/&m=1385983411&p=1412056831&i=25829&g=http://aeon.co/magazine/?p=25829'

parsed_url = urlparse.parse_qs(urlparse.urlparse(url).query)

for p in parsed_url:
    print p + '=' + parsed_url[p][0]

和输出:

>> python test.py
url=http://aeon.co/magazine/technology/the-elon-musk-interview-on-mars/
p=1412056831
m=1385983411
i=25829
g=http://aeon.co/magazine/?p=25829

这篇关于使用scrapy提取XHR请求?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆