统计分析和报告撰写的工作流程 [英] Workflow for statistical analysis and report writing

查看:31
本文介绍了统计分析和报告撰写的工作流程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有人对与自定义​​报告编写相关的数据分析工作流程有任何智慧?用例基本上是这样的:

  1. 客户委托使用数据分析的报告,例如一个水域的人口估计和相关地图.

  2. 分析师下载一些数据,整理数据并保存结果(例如,为每单位人口添加一列,或根据地区边界对数据进行子集).

  3. 分析师分析在 (2) 中创建的数据,接近她的目标,但认为需要更多数据,因此返回 (1).

  4. 冲洗重复,直到表格和图形满足 QA/QC 并满足客户要求.

  5. 编写包含表格和图形的报告.

  6. 明年,快乐的客户回来了,想要更新.这应该像通过新下载更新上游数据一样简单(例如,从去年获得建筑许可),然后按重新计算"按钮,除非规格发生变化.

目前,我只是创建一个目录并尽我所能将其临时设置.我想要一个更系统的方法,所以我希望有人能解决这个问题……我混合使用了电子表格、SQL、ARCGIS、R 和 Unix 工具.

谢谢!

PS:

下面是一个基本的 Makefile,它检查各种中间数据集(带 .RData 后缀)和脚本(.R 后缀)的依赖关系.Make 使用时间戳来检查依赖关系,所以如果你 touch ss07por.csv,它会看到这个文件比所有依赖它的文件/目标都新,并执行给定的脚本以更新他们相应地.这仍然是一项正在进行的工作,包括一个放入 SQL 数据库的步骤,以及一个像 sweave 这样的模板语言的步骤.请注意,Make 在其语法中依赖于制表符,因此在剪切和粘贴之前请阅读手册.享受并提供反馈!

http://www.gnu.org/software/make/manual/html_node/index.html#Top

<上一页>R=/home/wsprague/R-2.9.2/bin/Rpersondata.RData : ImportData.R ../../DATA/ss07por.csv Functions.R$R --slave -f ImportData.Rpersondata.Munged.RData : MungeData.R persondata.RData Functions.R$R --slave -f MungeData.Rreport.txt:TabulateAndGraph.R persondata.Munged.RData Functions.R$R --slave -f TabulateAndGraph.R > report.txt

解决方案

我一般把我的项目分成4个部分:

  1. 加载.R
  2. 干净的.R
  3. func.R
  4. 做.R

load.R:负责加载所需的所有数据.通常这是一个短文件,从文件、URL 和/或 ODBC 中读取数据.根据此时的项目,我将使用 save() 写出工作区,或者将内容保存在内存中以供下一步使用.

clean.R:这是所有丑陋的东西都存在的地方——处理缺失值、合并数据框、处理异常值.

func.R:包含执行实际分析所需的所有函数.source()'ing 这个文件应该没有副作用,除了加载函数定义.这意味着您可以修改此文件并重新加载它,而无需返回重复步骤 1 &2 对于大型数据集可能需要很长时间才能运行.

do.R:调用func.R中定义的函数进行分析并生成图表.

此设置的主要动机是处理大数据,因此您不希望每次更改后续步骤时都必须重新加载数据.此外,像这样保持我的代码分隔意味着我可以回到一个早已被遗忘的项目并快速阅读 load.R 并计算出我需要更新哪些数据,然后查看 do.R 以确定执行了哪些分析.

Does anyone have any wisdom on workflows for data analysis related to custom report writing? The use-case is basically this:

  1. Client commissions a report that uses data analysis, e.g. a population estimate and related maps for a water district.

  2. The analyst downloads some data, munges the data and saves the result (e.g. adding a column for population per unit, or subsetting the data based on district boundaries).

  3. The analyst analyzes the data created in (2), gets close to her goal, but sees that needs more data and so goes back to (1).

  4. Rinse repeat until the tables and graphics meet QA/QC and satisfy the client.

  5. Write report incorporating tables and graphics.

  6. Next year, the happy client comes back and wants an update. This should be as simple as updating the upstream data by a new download (e.g. get the building permits from the last year), and pressing a "RECALCULATE" button, unless specifications change.

At the moment, I just start a directory and ad-hoc it the best I can. I would like a more systematic approach, so I am hoping someone has figured this out... I use a mix of spreadsheets, SQL, ARCGIS, R, and Unix tools.

Thanks!

PS:

Below is a basic Makefile that checks for dependencies on various intermediate datasets (w/ .RData suffix) and scripts (.R suffix). Make uses timestamps to check dependencies, so if you touch ss07por.csv, it will see that this file is newer than all the files / targets that depend on it, and execute the given scripts in order to update them accordingly. This is still a work in progress, including a step for putting into SQL database, and a step for a templating language like sweave. Note that Make relies on tabs in its syntax, so read the manual before cutting and pasting. Enjoy and give feedback!

http://www.gnu.org/software/make/manual/html_node/index.html#Top

R=/home/wsprague/R-2.9.2/bin/R

persondata.RData : ImportData.R ../../DATA/ss07por.csv Functions.R
   $R --slave -f ImportData.R

persondata.Munged.RData : MungeData.R persondata.RData Functions.R
      $R --slave -f MungeData.R

report.txt:  TabulateAndGraph.R persondata.Munged.RData Functions.R
      $R --slave -f TabulateAndGraph.R > report.txt

解决方案

I generally break my projects into 4 pieces:

  1. load.R
  2. clean.R
  3. func.R
  4. do.R

load.R: Takes care of loading in all the data required. Typically this is a short file, reading in data from files, URLs and/or ODBC. Depending on the project at this point I'll either write out the workspace using save() or just keep things in memory for the next step.

clean.R: This is where all the ugly stuff lives - taking care of missing values, merging data frames, handling outliers.

func.R: Contains all of the functions needed to perform the actual analysis. source()'ing this file should have no side effects other than loading up the function definitions. This means that you can modify this file and reload it without having to go back an repeat steps 1 & 2 which can take a long time to run for large data sets.

do.R: Calls the functions defined in func.R to perform the analysis and produce charts and tables.

The main motivation for this set up is for working with large data whereby you don't want to have to reload the data each time you make a change to a subsequent step. Also, keeping my code compartmentalized like this means I can come back to a long forgotten project and quickly read load.R and work out what data I need to update, and then look at do.R to work out what analysis was performed.

这篇关于统计分析和报告撰写的工作流程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆