R中基于连续变量的生存概率|泰坦尼克号数据集 [英] Survival probability based on continuous variable in R | Titanic dataset

查看:53
本文介绍了R中基于连续变量的生存概率|泰坦尼克号数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下是泰坦尼克号数据集,我试图在其中找到基于 sexfare 的条件生存概率.Sex 是分类变量,fare 是连续变量.

Following is the titanic data set in which I am trying to find the conditional probability of survival based on sex and fare. Sex is a categorical variable and fare is continuous variable.

library(PASWR2)
library(magrittr)
library(data.table)

# convert dataset from data frame to data table 
titanic3 <- copy(TITANIC3)
setDT(titanic3)

下面的语句求的是fare的确切值的概率,但是,我想根据fare列的概率分布来求它.

The following statement finds the probability of the exact value of fare, however, I want to find it based on the probability distribution of the fare column.

titanic3[, survival_prob := round(100 * mean(survived), 1), by = .(fare, sex)]

我尝试将 fare 变量从连续变量转换为分类变量,然后计算概率,结果有些准确,但是,概率会根据我在制作时创建的 bin 的大小而发生显着变化分类变量.

I have tried to convert the fare variable from continuous to categorical and then calculated the probability, and the results were somewhat accurate however, probability change substantially based on the size of bins I create while making the categorical variable.

有更好的方法吗?

谢谢.

推荐答案

您想知道基于性别和票价的条件生存概率.然而,票价是一个连续变量.所以你不能简单地应用你的方法.在您的场景中,有必要找到合适的统计方法.

You want to know the conditional probability of survival based on sex and fare. However, fare is a continuous variable. So you cannot simply apply your approach. In your scenario it is necessary to find a proper statistical approach.

一种方法是逻辑回归.首先,您使用逻辑回归估计统计模型.然后您从对象 mdl 中提取与您想要的条件概率相对应的拟合值.但是请注意,估计条件概率有不同的统计方法,逻辑回归只是其中之一.不过,它广泛用于此类任务.

One approach is logistic regression. At first, you estimate a statistical model using logistic regression. Then you extract from object mdl the fitted values which correspond to the conditional probabilities you want. Note, however, that there are different statistical approaches to estimate conditional probabilities and logistic regression is only one of them. It is widely used for tasks like this one, though.

library(PASWR2)
library(magrittr)
library(data.table)


titanic3 <- copy(TITANIC3)
setDT(titanic3)


# use logistic regression to estimate the conditional probability to survive
# based on fare and sex
mdl <- glm(survived ~ fare + sex, family = binomial(), data = titanic3)

# extract fitted values which corresponds to the conditional probability
mdl$fitted.values

这篇关于R中基于连续变量的生存概率|泰坦尼克号数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆