R中基于连续变量的生存概率|泰坦尼克号数据集 [英] Survival probability based on continuous variable in R | Titanic dataset
问题描述
以下是泰坦尼克号数据集,我试图在其中找到基于 sex
和 fare
的条件生存概率.Sex
是分类变量,fare
是连续变量.
Following is the titanic data set in which I am trying to find the conditional probability of survival based on sex
and fare
. Sex
is a categorical variable and fare
is continuous variable.
library(PASWR2)
library(magrittr)
library(data.table)
# convert dataset from data frame to data table
titanic3 <- copy(TITANIC3)
setDT(titanic3)
下面的语句求的是fare
的确切值的概率,但是,我想根据fare
列的概率分布来求它.
The following statement finds the probability of the exact value of fare
, however, I want to find it based on the probability distribution of the fare
column.
titanic3[, survival_prob := round(100 * mean(survived), 1), by = .(fare, sex)]
我尝试将 fare
变量从连续变量转换为分类变量,然后计算概率,结果有些准确,但是,概率会根据我在制作时创建的 bin 的大小而发生显着变化分类变量.
I have tried to convert the fare
variable from continuous to categorical and then calculated the probability, and the results were somewhat accurate however, probability change substantially based on the size of bins I create while making the categorical variable.
有更好的方法吗?
谢谢.
推荐答案
您想知道基于性别和票价的条件生存概率.然而,票价是一个连续变量.所以你不能简单地应用你的方法.在您的场景中,有必要找到合适的统计方法.
You want to know the conditional probability of survival based on sex and fare. However, fare is a continuous variable. So you cannot simply apply your approach. In your scenario it is necessary to find a proper statistical approach.
一种方法是逻辑回归.首先,您使用逻辑回归估计统计模型.然后您从对象 mdl
中提取与您想要的条件概率相对应的拟合值.但是请注意,估计条件概率有不同的统计方法,逻辑回归只是其中之一.不过,它广泛用于此类任务.
One approach is logistic regression. At first, you estimate a statistical model using logistic regression. Then you extract from object mdl
the fitted values which correspond to the conditional probabilities you want. Note, however, that there are different statistical approaches to estimate conditional probabilities and logistic regression is only one of them. It is widely used for tasks like this one, though.
library(PASWR2)
library(magrittr)
library(data.table)
titanic3 <- copy(TITANIC3)
setDT(titanic3)
# use logistic regression to estimate the conditional probability to survive
# based on fare and sex
mdl <- glm(survived ~ fare + sex, family = binomial(), data = titanic3)
# extract fitted values which corresponds to the conditional probability
mdl$fitted.values
这篇关于R中基于连续变量的生存概率|泰坦尼克号数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!