Scrapy抓取Google Analytics(分析) [英] Scraping Google Analytics by Scrapy

查看:83
本文介绍了Scrapy抓取Google Analytics(分析)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在尝试使用Scrapy从Google Analytics(分析)中获取一些数据,尽管我是一名完全的Python新手,但我还是取得了一些进步.我现在可以通过Scrapy登录到Google Analytics(分析),但是我需要发出AJAX请求以获取所需的数据.我尝试使用下面的代码复制浏览器的HTTP请求标头,但它似乎不起作用,我的错误日志显示

I have been trying to use Scrapy to get some data from Google Analytics and despite the fact that I'm a complete Python newbie I have made some progress. I can now login to Google Analytics by Scrapy but I need to make an AJAX request to get the data what I want. I have tried to replicate my browser's HTTP request header with the code below but it doesn't seem to work, my error log says

要解压的值太多

有人可以帮忙吗?我已经做了两天了,感觉很亲密,但也很困惑.

Could somebody help? I've been worked on it for two days, I have the feeling that I'm very close but I'm also very confused.

这是代码:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import FormRequest, Request
from scrapy.selector import Selector
import  logging
from super.items import SuperItem
from scrapy.shell import inspect_response
import json

class LoginSpider(BaseSpider):
    name = 'super'
    start_urls = ['https://accounts.google.com/ServiceLogin?service=analytics&passive=true&nui=1&hl=fr&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&followup=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr#identifier']

    def parse(self, response):
        return [FormRequest.from_response(response,
                    formdata={'Email': 'Email'},

                    callback=self.log_password)]


    def log_password(self, response):
        return [FormRequest.from_response(response,
                    formdata={'Passwd': 'Password'},

                    callback=self.after_login)]

    def after_login(self, response):
      if "authentication failed" in response.body:
        self.log("Login failed", level=logging.ERROR)
        return
    # We've successfully authenticated, let's have some fun!
      else:
        print("Login Successful!!")
        return Request(url="https://analytics.google.com/analytics/web/getPage?id=trafficsources-all-traffic&ds=a5425w87291514p94531107&hl=fr&authuser=0",
               method='POST',
               headers=[{'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8',
                         'Galaxy-Ajax': 'true',
                         'Origin': 'https://analytics.google.com',
                         'Referer': 'https://analytics.google.com/analytics/web/?hl=fr&pli=1',
                         'User-Agent': 'My-user-agent',
                         'X-GAFE4-XSRF-TOKEN': 'Mytoken'}],
               callback=self.parse_tastypage, dont_filter=True)


    def parse_tastypage(self, response):
        response = json.loads(jsonResponse)

        inspect_response(response, self)
        yield item

这是日志的一部分:

2016-03-28 19:11:39 [scrapy] INFO: Enabled item pipelines:
[]
2016-03-28 19:11:39 [scrapy] INFO: Spider opened
2016-03-28 19:11:39 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-28 19:11:39 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-28 19:11:40 [scrapy] DEBUG: Crawled (200) <GET https://accounts.google.com/ServiceLogin?service=analytics&passive=true&nui=1&hl=fr&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&followup=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr#identifier> (referer: None)
2016-03-28 19:11:46 [scrapy] DEBUG: Crawled (200) <POST https://accounts.google.com/AccountLoginInfo> (referer: https://accounts.google.com/ServiceLogin?service=analytics&passive=true&nui=1&hl=fr&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&followup=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr)
2016-03-28 19:11:50 [scrapy] DEBUG: Redirecting (302) to <GET https://accounts.google.com/CheckCookie?hl=fr&checkedDomains=youtube&pstMsg=0&chtml=LoginDoneHtml&service=analytics&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&gidl=CAA> from <POST https://accounts.google.com/ServiceLoginAuth>
2016-03-28 19:11:57 [scrapy] DEBUG: Redirecting (302) to <GET https://www.google.com/analytics/web/?hl=fr> from <GET https://accounts.google.com/CheckCookie?hl=fr&checkedDomains=youtube&pstMsg=0&chtml=LoginDoneHtml&service=analytics&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&gidl=CAA>
2016-03-28 19:12:01 [scrapy] DEBUG: Crawled (200) <GET https://www.google.com/analytics/web/?hl=fr> (referer: https://accounts.google.com/AccountLoginInfo)
Login Successful!!
2016-03-28 19:12:01 [scrapy] ERROR: Spider error processing <GET https://www.google.com/analytics/web/?hl=fr> (referer: https://accounts.google.com/AccountLoginInfo)
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 577, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/Users/aminbouraiss/super/super/spiders/mySuper.py", line 42, in after_login
    callback=self.parse_tastypage, dont_filter=True)
  File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/http/request/__init__.py", line 35, in __init__
    self.headers = Headers(headers or {}, encoding=encoding)
  File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/http/headers.py", line 12, in __init__
    super(Headers, self).__init__(seq)
  File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/utils/datatypes.py", line 193, in __init__
    self.update(seq)
  File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/utils/datatypes.py", line 229, in update
    super(CaselessDict, self).update(iseq)
  File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/utils/datatypes.py", line 228, in <genexpr>
    iseq = ((self.normkey(k), self.normvalue(v)) for k, v in seq)
ValueError: too many values to unpack
2016-03-28 19:12:01 [scrapy] INFO: Closing spider (finished)
2016-03-28 19:12:01 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 6419,
 'downloader/request_count': 5,
 'downloader/request_method_count/GET': 3,
 'downloader/request_method_count/POST': 2,
 'downloader/response_bytes': 75986,
 'downloader/response_count': 5,
 'downloader/response_status_count/200': 3,
 'downloader/response_status_count/302': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 3, 28, 23, 12, 1, 824033),
 'log_count/DEBUG': 6,

推荐答案

您的错误是因为标头必须是字典,而不是字典中的列表:

Your error is because headers needs to be a dict, not a list inside a dict:

  headers={'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8',

                          'Galaxy-Ajax': 'true',
                          'Origin': 'https://analytics.google.com',
                          'Referer': 'https://analytics.google.com/analytics/web/?hl=fr&pli=1',
                          'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36',
                          },

这将解决您当前的问题,但是您还会得到411,因为您还需要指定内容长度,如果您添加要从中提取的内容,我将向您展示如何操作.您可以在下面看到输出:

That will fix your current issue but you will get a 411 as you need to specify the content-length also, if you add what you want to pull from I will be able to show you how. You can see the output below:

2016-03-29 14:02:11 [scrapy] DEBUG: Redirecting (302) to <GET https://www.google.com/analytics/web/?hl=fr> from <GET https://accounts.google.com/CheckCookie?hl=fr&checkedDomains=youtube&pstMsg=0&chtml=LoginDoneHtml&service=analytics&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&gidl=CAA>
2016-03-29 14:02:13 [scrapy] DEBUG: Crawled (200) <GET https://www.google.com/analytics/web/?hl=fr> (referer: https://accounts.google.com/AccountLoginInfo)
Login Successful!!
2016-03-29 14:02:14 [scrapy] DEBUG: Crawled (411) <POST https://analytics.google.com/analytics/web/getPage?id=trafficsources-all-traffic&ds=a5425w87291514p94531107&hl=fr&authuser=0> (referer: https://analytics.google.com/analytics/web/?hl=fr&pli=1)
2016-03-29 14:02:14 [scrapy] DEBUG: Ignoring response <411 https://analytics.google.com/analytics/web/getPage?id=trafficsources-all-traffic&ds=a5425w87291514p94531107&hl=fr&authuser=0>: HTTP status code is not handled or not allowed

这篇关于Scrapy抓取Google Analytics(分析)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆