统计分析和报告编写的工作流程 [英] Workflow for statistical analysis and report writing

查看:39
本文介绍了统计分析和报告编写的工作流程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有人对与自定义​​报告编写相关的数据分析工作流程有任何了解?用例基本上是这样的:

  1. 客户委托使用数据分析的报告,例如一个水域的人口估计和相关地图.

  2. 分析师下载一些数据,修改数据并保存结果(例如,为每单位人口添加一列,或根据地区边界对数据进行子集化).

  3. 分析师分析在 (2) 中创建的数据,接近她的目标,但发现需要更多数据,因此返回 (1).

  4. 反复冲洗,直到表格和图形符合 QA/QC 并让客户满意.

  5. 编写包含表格和图形的报告.

  6. 明年,满意的客户回来了,想要更新.这应该像通过新的下载更新上游数据一样简单(例如从去年获得建筑许可),然后按重新计算"按钮,除非规格发生变化.

目前,我只是启动一个目录,并尽我所能地对其进行临时设置.我想要一个更系统的方法,所以我希望有人已经解决了这个问题……我混合使用了电子表格、SQL、ARCGIS、R 和 Unix 工具.

谢谢!

附注:

下面是一个基本的 Makefile,用于检查对各种中间数据集(带有 .RData 后缀)和脚本(.R 后缀)的依赖关系.Make使用时间戳来检查依赖关系,所以如果你touch ss07por.csv,它会看到这个文件比所有依赖它的文件/目标都要新,并执行给定的脚本以更新他们相应地.这仍然是一项正在进行的工作,包括一个放入 SQL 数据库的步骤,以及一个像 sweave 这样的模板语言的步骤.请注意,Make 在其语法中依赖于制表符,因此在剪切和粘贴之前请阅读手册.享受并提供反馈!

http://www.gnu.org/software/make/manual/html_node/index.html#Top

<前>R=/home/wsprague/R-2.9.2/bin/Rpersondata.RData : ImportData.R ../../DATA/ss07por.csv Functions.R$R --slave -f ImportData.Rpersondata.Munged.RData : MungeData.R persondata.RData Functions.R$R --slave -f MungeData.Rreport.txt:TabulateAndGraph.R persondata.Munged.RData Functions.R$R --slave -f TabulateAndGraph.R > report.txt

解决方案

我通常将我的项目分成 4 个部分:

  1. load.R
  2. clean.R
  3. func.R
  4. 做.R

load.R:负责加载所有需要的数据.通常这是一个短文件,从文件、URL 和/或 ODBC 中读取数据.根据此时的项目,我将使用 save() 写出工作区,或者只是将内容保存在内存中以备下一步使用.

clean.R:这是所有丑陋的东西存在的地方 - 处理缺失值、合并数据框、处理异常值.

func.R:包含执行实际分析所需的所有函数.source() 除了加载函数定义之外,这个文件应该没有副作用.这意味着您可以修改此文件并重新加载它,而无需返回重复步骤 1 &2 对于大型数据集可能需要很长时间才能运行.

do.R:调用 func.R 中定义的函数来执行分析并生成图表和表格.

此设置的主要动机是处理大数据,因此您不希望每次对后续步骤进行更改时都必须重新加载数据.此外,像这样保持我的代码分区意味着我可以回到一个长期被遗忘的项目并快速读取 load.R 并计算出我需要更新的数据,然后查看 do.R 以计算执行了哪些分析.

Does anyone have any wisdom on workflows for data analysis related to custom report writing? The use-case is basically this:

  1. Client commissions a report that uses data analysis, e.g. a population estimate and related maps for a water district.

  2. The analyst downloads some data, munges the data and saves the result (e.g. adding a column for population per unit, or subsetting the data based on district boundaries).

  3. The analyst analyzes the data created in (2), gets close to her goal, but sees that needs more data and so goes back to (1).

  4. Rinse repeat until the tables and graphics meet QA/QC and satisfy the client.

  5. Write report incorporating tables and graphics.

  6. Next year, the happy client comes back and wants an update. This should be as simple as updating the upstream data by a new download (e.g. get the building permits from the last year), and pressing a "RECALCULATE" button, unless specifications change.

At the moment, I just start a directory and ad-hoc it the best I can. I would like a more systematic approach, so I am hoping someone has figured this out... I use a mix of spreadsheets, SQL, ARCGIS, R, and Unix tools.

Thanks!

PS:

Below is a basic Makefile that checks for dependencies on various intermediate datasets (w/ .RData suffix) and scripts (.R suffix). Make uses timestamps to check dependencies, so if you touch ss07por.csv, it will see that this file is newer than all the files / targets that depend on it, and execute the given scripts in order to update them accordingly. This is still a work in progress, including a step for putting into SQL database, and a step for a templating language like sweave. Note that Make relies on tabs in its syntax, so read the manual before cutting and pasting. Enjoy and give feedback!

http://www.gnu.org/software/make/manual/html_node/index.html#Top

R=/home/wsprague/R-2.9.2/bin/R

persondata.RData : ImportData.R ../../DATA/ss07por.csv Functions.R
   $R --slave -f ImportData.R

persondata.Munged.RData : MungeData.R persondata.RData Functions.R
      $R --slave -f MungeData.R

report.txt:  TabulateAndGraph.R persondata.Munged.RData Functions.R
      $R --slave -f TabulateAndGraph.R > report.txt

解决方案

I generally break my projects into 4 pieces:

  1. load.R
  2. clean.R
  3. func.R
  4. do.R

load.R: Takes care of loading in all the data required. Typically this is a short file, reading in data from files, URLs and/or ODBC. Depending on the project at this point I'll either write out the workspace using save() or just keep things in memory for the next step.

clean.R: This is where all the ugly stuff lives - taking care of missing values, merging data frames, handling outliers.

func.R: Contains all of the functions needed to perform the actual analysis. source()'ing this file should have no side effects other than loading up the function definitions. This means that you can modify this file and reload it without having to go back an repeat steps 1 & 2 which can take a long time to run for large data sets.

do.R: Calls the functions defined in func.R to perform the analysis and produce charts and tables.

The main motivation for this set up is for working with large data whereby you don't want to have to reload the data each time you make a change to a subsequent step. Also, keeping my code compartmentalized like this means I can come back to a long forgotten project and quickly read load.R and work out what data I need to update, and then look at do.R to work out what analysis was performed.

这篇关于统计分析和报告编写的工作流程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆