如何在回归分析中实现潜在Dirichlet分配 [英] How to implement Latent Dirichlet Allocation in regression analysis

查看:86
本文介绍了如何在回归分析中实现潜在Dirichlet分配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据集,包括酒店评论,评分和其他功能,例如旅行者类型和评论的字数.我想执行主题建模(LDA),并使用从评论以及其他功能中得出的主题来确定对评分影响最大的功能(评分为因变量).

I have a dataset consisting of hotel reviews, ratings, and other features such as traveler type, and word count of the review. I want to perform topic modeling (LDA) and use the topics derived from the reviews as well as other features to identify the features that most affects the ratings (ratings as the dependent variable).

如果我想使用线性回归来做到这一点,这是否意味着我必须用衍生的主题来标记每个评论?有没有办法在R中做到这一点,还是我必须手动标记每个评论? (我通常是文本挖掘和数据科学的新手.)

If I want to use linear regression to do this, does this mean I would have to label each review with the topics derived? Is there a way to do this in R or will I have to manually label each review? (I am new to text mining and data science in general.)

推荐答案

简短的答案:您不必在每个评论中都标有派生的主题,因为您将依靠训练的主题模型来确定主题的评论,然后将其用于构建您的回归模型的特征.

The short answer : you don't have to label each review with the topics derived because you'd be relying on the topic model you train to determine the topics of the reviews, which would then be used to construct features for your regression model.

使用
的代码示例(在R中)对主题建模有很好的解释. www.tidytextmining.com/topicmodeling.html . 6.2.16.2.2部分应该可以帮助您快速入门.

There is a good explanation of topic modeling with code samples (in R) at
www.tidytextmining.com/topicmodeling.html. Sections 6.2.1 and 6.2.2 should help you quickly get started.

牢记以下两个原则

  • 每个文档(酒店点评)都是主题的混合
  • 每个主题都是单词的混合物

针对每条评论对主题模型进行培训之后,

once a topic model has been trained on the reviews, for every review,

  • 文档主题概率可以用作特征
  • 每个主题中的前N个术语可用于构建文档术语矩阵(每个评论都映射有零个或多个顶部术语),然后可将其用作附加功能

一个简化的示例:大概有4个主题属于评论.

A simplified example : there might be 4 topics the reviews broadly fall under.

  • 主题1可能与位置有关(热门术语:便捷位置火车站步行距离购物等)
  • 主题2可能与酒店员工有关(热门条款:接待处友好专业快速 late_checkout 等)
  • 主题3可能与酒店房间有关(热门条款:洁净室装饰品味等)
  • 主题4可能与酒店设施有关(热门条款:游泳池 wifi 健身中心等)
  • Topic 1 might be about location (top terms : convenient, location, train_station, walk_distance, shopping, etc)
  • Topic 2 might be about hotel staff (top terms : reception, friendly, professional, quick, late_checkout etc.)
  • Topic 3 might be about hotel rooms (top terms : clean_room, decor, tasteful, etc.)
  • Topic 4 might be about hotel amenities (top terms : pool, wifi, fitness_centre, etc.)

文档主题的概率与每个主题的最高术语结合在一起,可以用作类似于以下内容的功能:

The document-topic probabilities combined with the top terms of each topic can be used as features similar to :

  • topic_1_location_probability
  • topic_2_hotel_staff_probability
  • topic_3_hotel_room_probability
  • topic_4_hotel_amenities_probability is_convenient_location
  • is_train_station_nearby
  • is_walk_distance
  • is_clean
  • is_late_checkout
  • is_fitness_centre
  • topic_1_location_probability
  • topic_2_hotel_staff_probability
  • topic_3_hotel_room_probability
  • topic_4_hotel_amenities_probability is_convenient_location
  • is_train_station_nearby
  • is_walk_distance
  • is_clean
  • is_late_checkout
  • is_fitness_centre
  • etc.

获取新评论:

  • 上面的示例显示了如何创建初始训练数据集-基于训练模型.
  • 对于较新的评论(即以前未用于训练模型的评论),您不必重复上述整个练习.相反,可以使用经过训练的主题模型来标识以前未查看过的文档(评论)的主题.

希望对您有帮助.

这篇关于如何在回归分析中实现潜在Dirichlet分配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆