如何在回归分析中实现潜在Dirichlet分配 [英] How to implement Latent Dirichlet Allocation in regression analysis
问题描述
我有一个数据集,包括酒店评论,评分和其他功能,例如旅行者类型和评论的字数.我想执行主题建模(LDA),并使用从评论以及其他功能中得出的主题来确定对评分影响最大的功能(评分为因变量).
I have a dataset consisting of hotel reviews, ratings, and other features such as traveler type, and word count of the review. I want to perform topic modeling (LDA) and use the topics derived from the reviews as well as other features to identify the features that most affects the ratings (ratings as the dependent variable).
如果我想使用线性回归来做到这一点,这是否意味着我必须用衍生的主题来标记每个评论?有没有办法在R中做到这一点,还是我必须手动标记每个评论? (我通常是文本挖掘和数据科学的新手.)
If I want to use linear regression to do this, does this mean I would have to label each review with the topics derived? Is there a way to do this in R or will I have to manually label each review? (I am new to text mining and data science in general.)
推荐答案
简短的答案:您不必在每个评论中都标有派生的主题,因为您将依靠训练的主题模型来确定主题的评论,然后将其用于构建您的回归模型的特征.
The short answer : you don't have to label each review with the topics derived because you'd be relying on the topic model you train to determine the topics of the reviews, which would then be used to construct features for your regression model.
使用
的代码示例(在R中)对主题建模有很好的解释.
www.tidytextmining.com/topicmodeling.html . 6.2.1
和6.2.2
部分应该可以帮助您快速入门.
There is a good explanation of topic modeling with code samples (in R) at
www.tidytextmining.com/topicmodeling.html. Sections 6.2.1
and 6.2.2
should help you quickly get started.
牢记以下两个原则
- 每个文档(酒店点评)都是主题的混合
- 每个主题都是单词的混合物
针对每条评论对主题模型进行培训之后,
once a topic model has been trained on the reviews, for every review,
- 文档主题概率可以用作特征
- 每个主题中的前N个术语可用于构建文档术语矩阵(每个评论都映射有零个或多个顶部术语),然后可将其用作附加功能
一个简化的示例:大概有4个主题属于评论.
A simplified example : there might be 4 topics the reviews broadly fall under.
- 主题1可能与位置有关(热门术语:便捷,位置,火车站,步行距离,购物等)
- 主题2可能与酒店员工有关(热门条款:接待处,友好,专业,快速, late_checkout 等)
- 主题3可能与酒店房间有关(热门条款:洁净室,装饰,品味等)
- 主题4可能与酒店设施有关(热门条款:游泳池, wifi ,健身中心等)
- Topic 1 might be about location (top terms : convenient, location, train_station, walk_distance, shopping, etc)
- Topic 2 might be about hotel staff (top terms : reception, friendly, professional, quick, late_checkout etc.)
- Topic 3 might be about hotel rooms (top terms : clean_room, decor, tasteful, etc.)
- Topic 4 might be about hotel amenities (top terms : pool, wifi, fitness_centre, etc.)
文档主题的概率与每个主题的最高术语结合在一起,可以用作类似于以下内容的功能:
The document-topic probabilities combined with the top terms of each topic can be used as features similar to :
-
topic_1_location_probability
-
topic_2_hotel_staff_probability
-
topic_3_hotel_room_probability
-
topic_4_hotel_amenities_probability
is_convenient_location
-
is_train_station_nearby
-
is_walk_distance
-
is_clean
-
is_late_checkout
-
is_fitness_centre
- 等
topic_1_location_probability
topic_2_hotel_staff_probability
topic_3_hotel_room_probability
topic_4_hotel_amenities_probability
is_convenient_location
is_train_station_nearby
is_walk_distance
is_clean
is_late_checkout
is_fitness_centre
- etc.
获取新评论:
- 上面的示例显示了如何创建初始训练数据集-基于训练模型.
- 对于较新的评论(即以前未用于训练模型的评论),您不必重复上述整个练习.相反,可以使用经过训练的主题模型来标识以前未查看过的文档(评论)的主题. 此
希望对您有帮助.
这篇关于如何在回归分析中实现潜在Dirichlet分配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!