如何模拟流派对电影收视率的影响? [英] How can I model the effect of genre on movie ratings?

查看:69
本文介绍了如何模拟流派对电影收视率的影响?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用较大版本的movielens数据集(1000万行)在R中进行机器学习,其中我的任务是使用训练集中的数据预测验证集中的收视率.目前,我的模型如下:

I'm doing a machine learning exercise in R using a larger version of the movielens dataset (10 million rows), where my task is to predict ratings in the validation set using the data in the training set. Currently my model is as follows:

用户对电影i的评级i = mu + b_i + b_u + epsilon,其中mu是平均评级,b_i是每部电影的效果,b_u是每位用户的效果.Epsilon应该是随机误差项,但现在它还包含了我尚未考虑的类型的影响.

Rating by user u for movie i = mu + b_i + b_u + epsilon, where mu is the mean rating, b_i is the effect of each movie, b_u is the effect of each user. Epsilon is supposed to be the random error term, but right now it also contains the effect of genres which I haven't accounted for.

这是我当前数据集的屏幕截图,仅供参考-请注意,resid列包含减去mu,b_i和b_u后的剩余等级.

Here's a screenshot of my current dataset for reference - note that the resid column contains the residual rating after subtracting mu, b_i, and b_u.

我被困住了,因为我不知道如何为流派的效果建模.有没有人对我如何继续有任何提示?

I'm stuck because I have no idea how to model the effect of genres. Does anyone have any tips on how I can proceed?

推荐答案

主要思想:将类型"字段中的每个值转换为单独的字段(喜剧,浪漫)和值(Y/N,0/1).

Main Idea: Convert each value in the "Genre" field as individual fields, (Comedy, Romance) with value (Y/N, 0/1).

我正在向您展示以下示例数据.这应该给您一个想法,您可以继续进行数据处理.

I am showing you with below sample data. This should give you an idea and you can proceed with your data.

sample <- tribble(~ Values,
                  "apple|banana",
                  "orange|apple",
                  "banana|guava")
sample

步骤:

  1. 使用tidyr的单独功能分隔字段中可用的值

  1. Separate the values available in the field,using separate function of tidyr

sample %>% separate(Values, into = c("val1","val2"), sep = "\\|") -> sample2
sample2

  • 使用tidyr的gather函数将所有单个值收集到单个列中

  • Gather all individual values into single column, using gather function of tidyr

    sample2 %>% gather(key = "col_name", value = "col_val", val1, val2) ->sample3
    sample3
    

  • 最后,使用"col_val"字段获取所需的输出.即一键编码.

  • Finally, use "col_val" field to get the desired output. i.e. one-hot encoding.

    sample4 <- sample3 %>% select(2)
    sample4
    as.data.frame(model.matrix( ~ . -1, sample4))
    

    如果它对您有帮助,请告诉我.

    Let me know, if it helped you.

    学习愉快!

    这篇关于如何模拟流派对电影收视率的影响?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆