如何从Google Analytics中提取数据并从中建立数据仓库(网站)? [英] How to extract data from Google Analytics and build a data warehouse (webhouse) from it?

查看:165
本文介绍了如何从Google Analytics中提取数据并从中建立数据仓库(网站)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我拥有点击流数据,例如引荐网址,顶部着陆页,顶部退出页面以及诸如网页浏览量,访问次数和跳出次数等指标。目前尚无数据库存储所有这些信息。我需要从这个数据开始从头开始构建数据仓库(我相信这个数据仓库被称为web-house)。因此,我需要从Google Analytics中提取数据,并以日常自动的方式将数据加载到仓库中。我的问题是: -

<1>有可能吗?每天数据都会增加(有些是根据度量或度量来衡量的,例如访问量和一些新的推荐网站),仓库的加载过程如何?

2)什么ETL工具可以帮助我实现这一目标? Pentaho我认为有办法从Google Analytics中提取数据,有人使用它吗?这个过程如何进行?
除了答案之外,任何引用,链接都将被赞赏。

与往常一样,知道基础事务数据的结构 - 用于构建DW的原子组件 - 是第一步也是最大的一步。

基本上有两种选择,这取决于您如何检索数据。其中之一,在此问题的先前答案中已经提到,其中之一是通过GA API访问您的GA数据。这与数据显示在GA报告中的形式非常接近,而非交易数据。使用它作为数据源的优点是,你的ETL非常简单,只需从XML容器解析数据就可以得到所需的全部内容。

第二部分第二部分选项涉及更接近数据源地抓取数据。

没有什么复杂的,仍然有几行背景可能对您有用。




  • GA Web Dashboard由
    解析/过滤GA事务日志
    (容纳GA数据的容器

    对应于一个
    账户中的一个配置文件)。

  • 此日志中的每一行代表
    单个事务,并且从客户端以
    HTTP请求的形式向GA服务器交付

  • 附加到该请求(对于单像素GIF ,名义上
    )是
    a单个字符串,其中包含所有
    从该
    返回的数据_TrackPageview函数调用加上来自clie的数据nt DOM,为此客户设置 GA Cookie
    ,以及浏览器位置
    栏的
    内容( http:// www ... )。
  • 虽然这个请求来自
    客户端,但它由GA
    脚本调用(它驻留在客户端上)
    在执行GA的主
    数据收集函数
    (_TrackPageview)后立即生效。

    $ b $因此,直接使用这种交易数据可能是构建数据仓库的最自然的方式;另一个优点是可以避免中间API的额外开销)。


    遗传算法记录的各行通常不适用于遗传算法用户。不过,获得它们很简单。这两个步骤就足够了:


    1. 修改网站每页上的Google Analytics跟踪代码
      将每个GIF请求
      (GA日志文件中的一行)的副本发送给你的
      自己的服务器,特别是
      之前 调用
      _trackPageview(),添加以下行:
      $ b

        pageTracker._setLocalRemoteServerMode(); 


    2. 接下来,只需将单像素gif
      图像放入你的文档根目录并且调用
      it__utm.gif


    服务器活动日志将包含这些单独的转换行,这些转换行再次通过附加到GA跟踪像素的HTTP请求以及来自请求中的其他数据(例如,用户代理字符串)的字符串构建。这个前一个字符串只是键值对的串联,每个键都以字母utm开头(可能用于追踪追踪器)。并非每一个utm参数都出现在每个GIF请求中,例如,其中几个参数仅用于电子商务交易 - 这取决于交易。



    这里有一个实际的GIF请求(帐户ID已经过清理,否则完好无损):


    http://www.google-analytics.com/zh-cn/utlook .gif注意utmwv = 1&安培; utmn = 1669045322&安培; utmcs = UTF-8和; utmsr = 1280×800&安培; utmsc = 24比特及utmul = EN-US&安培; utmje = 1&安培; utmfl = 10.0%20r45&安培; utmcn = 1&安培; utmdt =位置%20Listings%20%7C%20Linden%20Lab&安培; utmhn = lindenlab.hrmdirect.com&安培; utmr = HTTP://lindenlab.com/employment& UTMP = /就业/ openings.php排序= DA&安培;&安培; utmac = UA- XXXXXX-X安培; utmcc = __ UTMA%3D87045125.1669045322.1274256051.1274256051.1274256051.1%3B%2B__utmb%3D87045125%3B%2B__utmc%3D87045125%3B%2B__utmz%3D87045125.1274256051.1.1.utmccn%3D(引荐)%7Cutmcsr%3Dlindenlab.com %7Cutmcct%3D%2Femployment%7Cutmcmd%3Dreferred%3B%2B


    正如您所看到的,该字符串包含一组键值对之间用&分隔。只需两个简单的步骤:(i)将该字符串拆分为&符号;和(ii)用简短的描述性短语替换每个gif参数(键),使得这更容易阅读:



    gatc_version p>

    GIF_req_unique_id 1669045322
    $ b language_encoding UTF- p>

    screen_resolution 1280x800
    $ b screen_color_depth 24位

    browser_language zh-cn

    java_enabled p>

    flash_version 10.0%20r45

    campaign_session_new p>

    page_title 职位%20Listings%20%7C%20Linden%20Lab

    host_name lindenlab.hrm direct.com

    referral_url http:/ /lindenlab.com/employment



    page_request /employment/openings.php?sort=da



    帐户字符串 UA-XXXXXX-X



    Cookie __utma%3D87045125.1669045322.1274256051。 1274256051.1274256051.1%3B%2B__utmb%3D87045125%3B%2B__utmc%3D87045125%3B%2B__utmz%3D87045125.1274256051.1.1.utmccn%3D(引荐)%7Cutmcsr%3Dlindenlab.com%7Cutmcct%3D%2Femployment%7Cutmcmd%3Dreferral%3B %2B



    Cookie也很容易解析(请参阅Google的简明说明



    GA cookies存储用户记录每次互动的大部分数据(例如,点击标记的下载链接,点击网站上其他页面的链接,第二天后续访问等)。因此,例如,__utma cookie由一组整数组成,每个组用。分隔;最后一组是该用户的访问计数(本例中为1)。

    I have click stream data such as referring URL, top landing pages, top exit pages and metrics such as page views, number of visits, bounces all in Google Analytics. There is no database yet where all this information might be stored. I am required to build a data warehouse from scratch(which I believe is known as web-house) from this data.So I need to extract data from Google Analytics and load it into a warehouse on a daily automated basis. My questions are:-

    1)Is it possible? Every day data increases (some in terms of metrics or measures such as visits and some in terms of new referring sites), how would the process of loading the warehouse go about?

    2)What ETL tool would help me to achieve this? Pentaho I believe has a way to pull out data from Google Analytics, has anyone used it? How does that process go? Any references, links would be appreciated besides answers.

    解决方案

    As always, knowing the structure of the underlying transaction data--the atomic components used to build a DW--is the first and biggest step.

    There are essentially two options, based on how you retrieve the data. One of these, already mentioned in a prior answer to this question, is to access your GA data via the GA API. This is pretty close to the form that the data appears in the GA Report, rather than transactional data. The advantage of using this as your data source is that your "ETL" is very simple, just parsing the data from the XML container is about all that's needed.

    The second option involves grabbing the data much closer to the source.

    Nothing complicated, still, a few lines of background are perhaps helpful here.

    • The GA Web Dashboard is created by parsing/filtering a GA transaction log (the container that holds the GA data that corresponds to one Profile in one Account).

    • Each line in this log represents a single transaction and is delivered to the GA server in the form of an HTTP Request from the client.

    • Appended to that Request (which is nominally for a single-pixel GIF) is a single string that contains all of the data returned from that _TrackPageview function call plus data from the client DOM, GA cookies set for this client, and the contents of the Browser's location bar (http://www....).

    • Though this Request is from the client, it is invoked by the GA script (which resides on the client) immediately after execution of GA's primary data-collecting function (_TrackPageview).

    So working directly with this transaction data is probably the most natural way to build a Data Warehouse; another advantage is that you avoid the additional overhead of an intermediate API).

    The individual lines of the GA log are not normally avaialble to GA users. Still, it's simple to get them. These two steps should suffice:

    1. modify the GA tracking code on each page of your Site so that it sends a copy of each GIF Request (one line in the GA logfile) to your own server, specifically, immeidately before the call to _trackPageview(), add this line:

      pageTracker._setLocalRemoteServerMode();
      

    2. Next, just put a single-pixel gif image in your document root and call it "__utm.gif".

    So now your server activity log will contain these individual transction lines, again built from a string appended to an HTTP Request for the GA tracking pixel as well as from other data in the Request (e.g., the User Agent string). This former string is just a concatenation of key-value pairs, each key begins with the letters "utm" (probably for "urching tracker"). Not every utm parameter appears in every GIF Request, several of them, for instance, are used only for e-commerce transactions--it depends on the transaction.

    Here's an actual GIF Request (account ID has been sanitized, otherwise it's intact):

    http://www.google-analytics.com/__utm.gif?utmwv=1&utmn=1669045322&utmcs=UTF-8&utmsr=1280x800&utmsc=24-bit&utmul=en-us&utmje=1&utmfl=10.0%20r45&utmcn=1&utmdt=Position%20Listings%20%7C%20Linden%20Lab&utmhn=lindenlab.hrmdirect.com&utmr=http://lindenlab.com/employment&utmp=/employment/openings.php?sort=da&&utmac=UA-XXXXXX-X&utmcc=__utma%3D87045125.1669045322.1274256051.1274256051.1274256051.1%3B%2B__utmb%3D87045125%3B%2B__utmc%3D87045125%3B%2B__utmz%3D87045125.1274256051.1.1.utmccn%3D(referral)%7Cutmcsr%3Dlindenlab.com%7Cutmcct%3D%2Femployment%7Cutmcmd%3Dreferral%3B%2B

    As you can see, this string is comprised of a set of key-value pairs each separated by an "&". Just two trivial steps: (i) Splitting this string on the ampersand; and (ii) replacing each gif parameter (key) with a short descriptive phrase, make this much easier to read:

    gatc_version 1

    GIF_req_unique_id 1669045322

    language_encoding UTF-8     

    screen_resolution         1280x800  

    screen_color_depth        24-bit    

    browser_language          en-us     

    java_enabled              1         

    flash_version             10.0%20r45

    campaign_session_new      1         

    page_title                Position%20Listings%20%7C%20Linden%20Lab

    host_name lindenlab.hrmdirect.com

    referral_url        http://lindenlab.com/employment

    page_request              /employment/openings.php?sort=da

    account_string            UA-XXXXXX-X

    cookies __utma%3D87045125.1669045322.1274256051.1274256051.1274256051.1%3B%2B__utmb%3D87045125%3B%2B__utmc%3D87045125%3B%2B__utmz%3D87045125.1274256051.1.1.utmccn%3D(referral)%7Cutmcsr%3Dlindenlab.com%7Cutmcct%3D%2Femployment%7Cutmcmd%3Dreferral%3B%2B

    The cookies are also simple to parse (see Google's concise description here): for instance,

    • __utma is the unique-visitor cookie,

    • __utmb, __utmc are session cookies, and

    • __utmz is the referral type.

    The GA cookies store the majority of the data that record each interaction by a user (e.g., clicking a tagged download link, clicking a link to another page on the Site, subsequent visit the next day, etc.). So for instance, the __utma cookie is comprised of a groups of integers, each group separated by a "."; the last group is the visit count for that user (a "1" in this case).

    这篇关于如何从Google Analytics中提取数据并从中建立数据仓库(网站)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆