统计分析和报告编写工作流程 [英] Workflow for statistical analysis and report writing

查看:110
本文介绍了统计分析和报告编写工作流程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人对与自定义​​报告编写相关的数据分析工作流有任何智慧吗?用例基本上是这样的:


  1. 客户委托使用数据分析的报告,例如


  2. 分析师下载了一些数据,对数据进行了调整并保存了结果(例如,为每


  3. 分析人员分析了(2)中创建的数据,接近了她的目标,但发现需要更多数据,因此返回到(1)。


  4. 重复冲洗,直到表格和图形满足QA / QC并满足客户要求。


  5. 编写包含表格和图形的报告。


  6. 明年,高兴的客户回来并想要一份更新。这应该很简单,例如通过新的下载更新上游数据(例如,从去年获得建筑许可),然后按重新计算按钮,除非规格更改。


此刻,我只是启动一个目录,并尽其所能对其进行临时设置。我想要一个更系统的方法,所以我希望有人能解决这个问题...我使用了电子表格,SQL,ARCGIS,R和Unix工具的组合。



谢谢!



PS:



下面是一个基本的Makefile,用于检查对各种中间数据集的依赖性( w / .RData 后缀)和脚本( .R 后缀)。 Make使用时间戳记检查依赖关系,因此,如果您 touch ss07por.csv ,它将看到此文件比依赖它的所有文件/目标都要新,并执行给定脚本以便相应地更新它们。这项工作仍在进行中,其中包括将数据放入SQL数据库的步骤以及诸如sweave之类的模板语言的步骤。请注意,Make的语法依赖制表符,因此在剪切和粘贴之前请先阅读手册。享受并提供反馈!



http://www.gnu.org/software/make/manual/html_node/index.html#Top

 
R = / home / wsprague / R-2.9.2 / bin / R

persondata.RData:ImportData.R ../../DATA/ss07por.csv Functions.R
$ R --slave -f ImportData.R

persondata.Munged.RData:MungeData.R persondata.RData Functions.R
$ R --slave -f MungeData.R

report.txt:TabulateAndGraph.R persondata.Munged.RData函数.R
$ R --slave -f TabulateAndGraph.R> report.txt


解决方案

我通常将项目分为4部分:


  1. load.R

  2. clean.R

  3. func.R

  4. do.R

load.R:负责加载所需的所有数据。通常,这是一个短文件,可从文件,URL和/或ODBC中读取数据。此时,根据项目的不同,我将使用 save()写出工作区,或者将其保存在内存中以供下一步使用。



clean.R:这是所有丑陋事物的住所-照顾缺失值,合并数据帧,处理异常值。



func.R:包含执行实际分析所需的所有功能。 source()除了加载函数定义外,没有其他影响。这意味着您可以修改此文件并重新加载它,而不必返回重复的步骤1和2,这可能需要很长时间才能运行大型数据集。



do.R:调用func.R中定义的函数以执行分析并生成图表和表格



此设置的主要动机是使用大数据,因此您不需要每次对后续数据进行更改时都必须重新加载数据。步。同样,将我的代码分隔成这样意味着我可以回到一个长期被遗忘的项目并快速读取load.R并确定需要更新的数据,然后查看do.R来确定执行了哪些分析。 / p>

Does anyone have any wisdom on workflows for data analysis related to custom report writing? The use-case is basically this:

  1. Client commissions a report that uses data analysis, e.g. a population estimate and related maps for a water district.

  2. The analyst downloads some data, munges the data and saves the result (e.g. adding a column for population per unit, or subsetting the data based on district boundaries).

  3. The analyst analyzes the data created in (2), gets close to her goal, but sees that needs more data and so goes back to (1).

  4. Rinse repeat until the tables and graphics meet QA/QC and satisfy the client.

  5. Write report incorporating tables and graphics.

  6. Next year, the happy client comes back and wants an update. This should be as simple as updating the upstream data by a new download (e.g. get the building permits from the last year), and pressing a "RECALCULATE" button, unless specifications change.

At the moment, I just start a directory and ad-hoc it the best I can. I would like a more systematic approach, so I am hoping someone has figured this out... I use a mix of spreadsheets, SQL, ARCGIS, R, and Unix tools.

Thanks!

PS:

Below is a basic Makefile that checks for dependencies on various intermediate datasets (w/ .RData suffix) and scripts (.R suffix). Make uses timestamps to check dependencies, so if you touch ss07por.csv, it will see that this file is newer than all the files / targets that depend on it, and execute the given scripts in order to update them accordingly. This is still a work in progress, including a step for putting into SQL database, and a step for a templating language like sweave. Note that Make relies on tabs in its syntax, so read the manual before cutting and pasting. Enjoy and give feedback!

http://www.gnu.org/software/make/manual/html_node/index.html#Top

R=/home/wsprague/R-2.9.2/bin/R

persondata.RData : ImportData.R ../../DATA/ss07por.csv Functions.R
   $R --slave -f ImportData.R

persondata.Munged.RData : MungeData.R persondata.RData Functions.R
      $R --slave -f MungeData.R

report.txt:  TabulateAndGraph.R persondata.Munged.RData Functions.R
      $R --slave -f TabulateAndGraph.R > report.txt

解决方案

I generally break my projects into 4 pieces:

  1. load.R
  2. clean.R
  3. func.R
  4. do.R

load.R: Takes care of loading in all the data required. Typically this is a short file, reading in data from files, URLs and/or ODBC. Depending on the project at this point I'll either write out the workspace using save() or just keep things in memory for the next step.

clean.R: This is where all the ugly stuff lives - taking care of missing values, merging data frames, handling outliers.

func.R: Contains all of the functions needed to perform the actual analysis. source()'ing this file should have no side effects other than loading up the function definitions. This means that you can modify this file and reload it without having to go back an repeat steps 1 & 2 which can take a long time to run for large data sets.

do.R: Calls the functions defined in func.R to perform the analysis and produce charts and tables.

The main motivation for this set up is for working with large data whereby you don't want to have to reload the data each time you make a change to a subsequent step. Also, keeping my code compartmentalized like this means I can come back to a long forgotten project and quickly read load.R and work out what data I need to update, and then look at do.R to work out what analysis was performed.

这篇关于统计分析和报告编写工作流程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆