有效地分割数据和拟合分布 [英] Splitting data and fitting distributions efficiently

查看:111
本文介绍了有效地分割数据和拟合分布的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于一个项目,我已经收到了大量的机密患者级别数据,我需要将这些数据拟合到分布中以便在模拟模型中使用它。我正在使用R。

For a project I have received a large amount of confidential patient level data that I need to fit a distribution to so as to use it in a simulation model. I am using R.

问题是我需要拟合分布以获取至少288个独立分布(至少48个子集的形状/速率数据) 6个变量)。该过程在变量之间略有不同(取决于该变量的分布方式),但是我希望能够为每个变量设置一个函数或循环,并为我定义的每个子集生成形状和速率数据。

The problem is that I need is to fit the distribution to get the shape/rate data for at least 288 separate distributions (at least 48 subsets of 6 variables). The process will vary slightly between variables (depending on how that variable is distributed) but I want to be able to set up a function or loop for each variable and generate the shape and rate data for each subset I define.

一个例子:我需要找到患者亚群的住院天数。有48个患者子集。我目前这样做的方式是手动过滤数据,然后将其提取到向量中,然后使用 fitdist 将数据拟合到向量中。

An example of this: I need to find length of stay data for subsets of patients. There are 48 subsets of patients. The way I have currently been doing this is by manually filtering the data and then extracting those to vectors, and then fitting the data to the vector using fitdist.

即对于伽玛分布的变量:

i.e. For a variable that is gamma distributed:

vector1 <- los_data %>%
filter(group == 1, setting == 1, diagnosis == 1)

fitdist(vector1, "gamma")

我对数据科学和数据处理还很陌生,我知道必须有比手工更简单的方法!我假设与矩阵有关,但是对于最好的方法我绝对一无所知。

I am quite new to data science and data processing, and I know there must be a simpler way to do this than by hand! I'm assuming something to do with a matrix, but I am absolutely clueless about how best to proceed.

推荐答案

一种常见的做法是使用 split 拆分数据,然后将感兴趣的功能应用于该组。假设这里有四列,即组,设置,诊断和stay.length。前三个有两个级别。

One common practice is to split the data using split and then apply the function of interest on that group. Let's assume here we have four columns, group, settings, diagnosis and stay.length. The first three have two levels.

df <- data.frame(
  group = sample(1:2, 64, TRUE),
  setting  = sample(1:2, 64, TRUE),
  diagnosis  = sample(1:2, 64, TRUE), 
  stay.length = sample(1:5, 64, TRUE)
)
> head(df)
    group setting diagnosis var
1     1       1         1   4
2     1       1         2   5
3     1       1         2   4
4     2       1         2   3
5     1       2         2   3
6     1       1         2   5

执行 split ,您将得到一个拆分的 List

Perform split and you will get a splitted List :

dfl <- split(df$stay.length, list(df$group, df$setting, df$diagnosis))

> head(dfl)
$`1.1.1`
[1] 5 3 4 1 4 5 4 2 1

$`2.1.1`
[1] 5 4 5 4 3 1 5 3 1

$`1.2.1`
[1] 4 2 5 4 5 3 5 3

$`2.2.1`
[1] 2 1 4 3 5 4 4

$`1.1.2`
[1] 5 4 4 4 3 2 4 4 5 1 5 5

$`2.1.2`
[1] 5 4 4 5 3 2 4 5 1 2    

然后,我们可以使用 lapply 对列表中的每个组执行任何功能。例如,我们可以应用平均值

Afterwards, we can use lapply to perform whatever function on each group in the list. For example we can apply mean

dflm <- lapply(dfl, mean)
> dflm
$`1.1.1`
[1] 3.222222

.
.
.
.

$`2.2.2`
[1] 2.8

您可以使用 fitdist 或任何其他函数。

In your case, you can apply fitdist or any other function.

dfl.fitdist <- lapply(dfl, function(x) fitdist(x, "gamma"))

> dfl
$`1.1.1`
Fitting of the distribution ' gamma ' by maximum likelihood 
Parameters:
  estimate Std. Error
shape  3.38170  2.2831073
rate   1.04056  0.7573495

.
.
.


$`2.2.2`
Fitting of the distribution ' gamma ' by maximum likelihood 
Parameters:
  estimate Std. Error
shape 4.868843  2.5184018
rate  1.549188  0.8441106

这篇关于有效地分割数据和拟合分布的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆