使用scrapy提取XHR请求? [英] Using scrapy to extract XHR request?

查看：73 发布时间：2021/7/16 22:16:18 xmlhttprequest web-scraping scrapy

本文介绍了使用scrapy提取XHR请求?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试抓取使用 javascript 生成的社交类计数.如果我绝对引用 XHR url，我就能够抓取所需的数据.但是我试图抓取的站点会动态生成这些 XMLHttpRequest，其中包含我不知道如何提取的查询字符串参数.

I'm trying to scrape social like counts that are being generated with javascript. I am able to scrape the desired data if I absolutely reference the XHR url. But the site I am trying to scrape dynamically generates these XMLHttpRequests with query string parameters that I do not know how to extract.

例如，可以看到使用每个页面唯一的 m、p、i 和 g 参数来构造请求 url.

For example, you can see that using the m, p, i, and g parameters unique to each page are used to construct the request url.

这是组合的网址:

http://aeon.co/magazine/social/social.php?url=http://aeon.co/magazine/technology/the-elon-musk-interview-on-mars/&m=1385983411&p=1412056831&i=25829&g=http://aeon.co/magazine/?p=25829

...返回此 JSON:

..which returns this JSON:

{"twitter":13325,"facebook":23481,"googleplusone":964,"disqus":272}

使用以下脚本，我能够从我刚刚提到的请求 url 中提取所需的数据(在本例中为 twitter 计数)，但仅限于该特定页面.

Using the following script, I am able to extract desired data (in this case twitter count) from the request url i just mentioned but only for that specific page.

import scrapy

from aeon.items import AeonItem
import json
from scrapy.http.request import Request

class AeonSpider(scrapy.Spider):
    name = "aeon"
    allowed_domains = ["aeon.co"]
    start_urls = [
        "http://aeon.co/magazine/technology"
]

def parse(self, response):
    items = []
    for sel in response.xpath('//*[@id="latestPosts"]/div/div/div'):
        item = AeonItem()
        item['title'] = sel.xpath('./a/p[1]/text()').extract()
        item['primary_url'] = sel.xpath('./a/@href').extract() 
        item['word_count'] = sel.xpath('./a/div/span[2]/text()').extract()      

        for each in item['primary_url']:
            yield Request(http://aeon.co/magazine/social/social.php?url=http://aeon.co/magazine/technology/the-elon-musk-interview-on-mars/&m=1385983411&p=1412056831&i=25829&g=http://aeon.co/magazine/?p=25829, callback=self.parse_XHR_data,meta={'item':item})                   


def XHR_data(self, response):
    jsonresponse = json.loads(response.body_as_unicode())
    item = response.meta['item']
    item["tw_count"] = jsonresponse["twitter"]  
    yield item

所以我的问题是，如何提取 m、p、i 和 g url 查询参数，以便我可以动态模拟请求 url?(而不是如上所示绝对引用它)

so my question is, how can I extract the m,p,i and g url query parameters so that I can dynamically simulate the request url? (rather than absolutely referencing it as shown above)

使用scrapy提取XHR请求? [英] Using scrapy to extract XHR request?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用scrapy提取XHR请求? [英] Using scrapy to extract XHR request?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭