使用scrapy提取XHR请求? [英] Using scrapy to extract XHR request?
问题描述
我正在尝试抓取使用 javascript 生成的社交类计数.如果我绝对引用 XHR url,我就能够抓取所需的数据.但是我试图抓取的站点会动态生成这些 XMLHttpRequest,其中包含我不知道如何提取的查询字符串参数.
I'm trying to scrape social like counts that are being generated with javascript. I am able to scrape the desired data if I absolutely reference the XHR url. But the site I am trying to scrape dynamically generates these XMLHttpRequests with query string parameters that I do not know how to extract.
例如,可以看到使用每个页面唯一的 m、p、i 和 g 参数来构造请求 url.
For example, you can see that using the m, p, i, and g parameters unique to each page are used to construct the request url.
这是组合的网址:
...返回此 JSON:
..which returns this JSON:
{"twitter":13325,"facebook":23481,"googleplusone":964,"disqus":272}
使用以下脚本,我能够从我刚刚提到的请求 url 中提取所需的数据(在本例中为 twitter 计数),但仅限于该特定页面.
Using the following script, I am able to extract desired data (in this case twitter count) from the request url i just mentioned but only for that specific page.
import scrapy
from aeon.items import AeonItem
import json
from scrapy.http.request import Request
class AeonSpider(scrapy.Spider):
name = "aeon"
allowed_domains = ["aeon.co"]
start_urls = [
"http://aeon.co/magazine/technology"
]
def parse(self, response):
items = []
for sel in response.xpath('//*[@id="latestPosts"]/div/div/div'):
item = AeonItem()
item['title'] = sel.xpath('./a/p[1]/text()').extract()
item['primary_url'] = sel.xpath('./a/@href').extract()
item['word_count'] = sel.xpath('./a/div/span[2]/text()').extract()
for each in item['primary_url']:
yield Request(http://aeon.co/magazine/social/social.php?url=http://aeon.co/magazine/technology/the-elon-musk-interview-on-mars/&m=1385983411&p=1412056831&i=25829&g=http://aeon.co/magazine/?p=25829, callback=self.parse_XHR_data,meta={'item':item})
def XHR_data(self, response):
jsonresponse = json.loads(response.body_as_unicode())
item = response.meta['item']
item["tw_count"] = jsonresponse["twitter"]
yield item
所以我的问题是,如何提取 m、p、i 和 g url 查询参数,以便我可以动态模拟请求 url?(而不是如上所示绝对引用它)
so my question is, how can I extract the m,p,i and g url query parameters so that I can dynamically simulate the request url? (rather than absolutely referencing it as shown above)
推荐答案
这是提取网址的方法:
import urlparse
url = 'http://aeon.co/magazine/social/social.php?url=http://aeon.co/magazine/technology/the-elon-musk-interview-on-mars/&m=1385983411&p=1412056831&i=25829&g=http://aeon.co/magazine/?p=25829'
parsed_url = urlparse.parse_qs(urlparse.urlparse(url).query)
for p in parsed_url:
print p + '=' + parsed_url[p][0]
和输出:
>> python test.py
url=http://aeon.co/magazine/technology/the-elon-musk-interview-on-mars/
p=1412056831
m=1385983411
i=25829
g=http://aeon.co/magazine/?p=25829
这篇关于使用scrapy提取XHR请求?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!