在名字和姓氏的向量上使用DocumentTermMatrix [英] Using DocumentTermMatrix on a Vector of First and Last Names

查看：161 发布时间：2020/5/18 1:09:45 r nlp tm

本文介绍了在名字和姓氏的向量上使用DocumentTermMatrix的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在数据框(df)中有一列，如下所示:

I have a column in my data frame (df) as follows:

> people = df$people
> people[1:3]
[1] "Christian Slater, Tara Reid, Stephen Dorff, Frank C. Turner"     
[2] "Ice Cube, Nia Long, Aleisha Allen, Philip Bolden"                
[3] "John Travolta, Uma Thurman, Vince Vaughn, Cedric the Entertainer"

该列具有4k +个唯一的名字/姓氏/昵称，作为每行上的全名列表，如上所示.我想为找到全名匹配的该列创建DocumentTermMatrix，并且仅将出现次数最多的名称用作列.我尝试了以下代码:

The column has 4k+ unique first/last/nick names as a list of full names on each row as shown above. I would like to create a DocumentTermMatrix for this column where full name matches are found and only the names that occur the most are used as columns. I have tried the following code:

> people_list = strsplit(people, ", ")

> corp = Corpus(VectorSource(people_list))

> dtm = DocumentTermMatrix(corp, people_dict)

其中people_dict是来自people_list的最常见的人(约150个人的全名)列表，如下所示:

where people_dict is a list of the most commonly occurring people (~150 full names of people) from people_list as follows:

> people_dict[1:3]
[[1]]
[1] "Christian Slater"

[[2]]
[1] "Tara Reid"

[[3]]
[1] "Stephen Dorff"

但是，DocumentTermMatrix函数似乎根本没有使用people_dict，因为与我的people_dict相比，我拥有更多的列.另外，我认为DocumentTermMatrix函数会将每个名称字符串拆分为多个字符串.例如，"Danny Devito"成为"Danny"和"Devito"的列.

However, the DocumentTermMatrix function seems to not be using the people_dict at all because I have way more columns than in my people_dict. Also, I think that the DocumentTermMatrix function is splitting each name string into multiple strings. For example, "Danny Devito" becomes a column for "Danny" and "Devito".

> inspect(actors_dtm[1:5,1:10])
<<DocumentTermMatrix (documents: 5, terms: 10)>>
Non-/sparse entries: 0/50
Sparsity           : 100%
Maximal term length: 9
Weighting          : term frequency (tf)

    Terms
Docs 'g. 'jojo' 'ole' 'piolin' 'rampage' 'spank' 'stevvi' a.d. a.j. aaliyah
   1   0      0     0        0         0       0        0    0    0       0
   2   0      0     0        0         0       0        0    0    0       0
   3   0      0     0        0         0       0        0    0    0       0
   4   0      0     0        0         0       0        0    0    0       0
   5   0      0     0        0         0       0        0    0    0       0

我已经阅读了所有可以找到的TM文档，并且花了数小时在stackoverflow上寻找解决方案.请帮忙！

I have read through all the TM documentation that I can find, and I have spent hours searching on stackoverflow for a solution. Please help!

推荐答案

默认的分词器将文本拆分为单个单词.您需要提供一个自定义功能

The default tokenizer splits text into individual words. You need to provide a custom function

commasplit_tokenizer <- function(x)
unlist(strsplit(as.character(x), ", "))

请注意，在创建语料库之前，请勿分离演员.

Note that you do not separate the actors before creating the corpus.

people <- character(3)
people[1] <- "Christian Slater, Tara Reid, Stephen Dorff, Frank C. Turner"     
people[2] <- "Ice Cube, Nia Long, Aleisha Allen, Philip Bolden"                
people[3] <- "John Travolta, Uma Thurman, Vince Vaughn, Cedric the Entertainer"

people_dict <- c("Stephen Dorff", "Nia Long", "Uma Thurman")

控制选项不适用于Coprus，我使用了VCorpus

The control options didn't work with just Coprus, I used VCorpus

corp = VCorpus(VectorSource(people))
dtm = DocumentTermMatrix(corp, control = list(tokenize = 
commasplit_tokenizer, dictionary = people_dict, tolower = FALSE))

所有选项都在控件内传递，包括:

All of the options are passed within control, including:

令牌化-功能
字典
tolower = FALSE

结果:

as.matrix(dtm)
Terms
Docs Nia LOng Stephen Dorff Uma Thurman
   1        0             1           0
   2        0             0           0
   3        0             0           1

我希望这对您有帮助

这篇关于在名字和姓氏的向量上使用DocumentTermMatrix的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在名字和姓氏的向量上使用DocumentTermMatrix [英] Using DocumentTermMatrix on a Vector of First and Last Names

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在名字和姓氏的向量上使用DocumentTermMatrix [英] Using DocumentTermMatrix on a Vector of First and Last Names

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭