通过单词出现表创建向量 [英] Creating Vector by Word Occurrence Table r

查看:79
本文介绍了通过单词出现表创建向量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要找到一种自动方法来获取原始矢量,并将每个单词(无论其在矢量中的位置如何)都转换为新的矢量.每个新矢量都反映了其基词在原始矢量的每个元素中的存在.

我需要打开它:

  OriginalVector<-c(灵活的红狐狸",懒狗",有趣的红狐狸") 

对此:

 灵活的红狐狸懒狗灰狗1 1 1 0 0 0 00 0 0 1 1 1 00 1 1 0 0 0 1 

每行应对应于原始向量中的每个元素.也就是说,数字1反映了原始向量敏捷的红狐狸"的第一个元素中每个单词的存在,第二行反映了懒惰的灰狗"中每个单词的出现,等等.

我的现实世界中的问题有300,000个元素和数十万个唯一单词.我可以使用 r grep() r grepl(),但是尝试单独构建每个向量会令人难以置信.他们是自动解决此问题的方法吗?

注意:我不是在寻找单词共现矩阵.相反,我需要一个频率表行(原始矢量元素)x字.

解决方案

任何自然语言处理框架都可以很容易地做到这一点.我喜欢 tidytext 这样的简单内容.在计算上有更快的方法,但这相当简单.

 库(全文)图书馆(dplyr)图书馆(tidyr)OriginalVector<-c(灵活的红狐狸",懒惰的灰狗",有趣的红狐狸")df<-tibble(id = seq_along(OriginalVector),text = OriginalVector)df%>%unnest_tokens(单词,文本)%>%count(id,word)%&%;%ivot_wider(id_cols = id,names_from =单词,values_from = n,values_fill = list(n = 0))%&%选择(-id)#小动作:3 x 7狐狸敏捷红色狗灰色懒搞笑< int>< int>< int>< int>< int>< int>< int>1 1 1 1 0 0 0 02 0 0 0 1 1 1 03 1 0 1 0 0 0 1 

您还可以提早退出框架,而只需使用 table .

 表(unnest_tokens(df,单词,文本))单词id狗狐狸搞笑灰色懒敏捷红色1 0 1 0 0 0 1 12 1 0 0 1 1 0 03 0 1 1 0 0 0 1 

请注意,默认情况下, unnest_tokens()具有选项 to_lower = TRUE .如果您不想这样做,可以将其更改为 FALSE .

I need to find an automated way to take my original vector and transform each word, regardless of position in the vector, into a new vector. Each new vector reflects the presence of its basis word in each element of the original vector.

I need to turn this:

OriginalVector <- c("Nimble red fox", "Lazy Grey Dog", "Red Fox funny")

into this:

Nimble Red    Fox    Lazy   Grey   Dog    Funny
1      1      1      0      0      0      0
0      0      0      1      1      1      0
0      1      1      0      0      0      1

Each row should should correspond to each element in the original vector. That is, the digit 1 reflects the presence of each word in the first element of the original vector "The nimble red fox", row two reflects the occurrence of each word in "Lazy Grey Dog", ...etc.

My real world problem has 300,000 more elements with several hundred thousand unique words. I could use r grep() or r grepl(), but trying to build each vector individually would be mind-boggling. Is their an automated way to solve this problem?

Note: I am not looking for a word co-occurrence matrix. Instead I need a frequency table row (original vector element) x word.

解决方案

Any of the natural language processing frameworks can do this fairly easily. I like tidytext for simple things like this. There are faster ones computationally, but this is fairly simple.

library(tidytext)
library(dplyr)
library(tidyr)

OriginalVector <- c("Nimble red fox", "Lazy Grey Dog", "Red Fox funny")

df <- tibble(id = seq_along(OriginalVector), text = OriginalVector)

df %>%
  unnest_tokens(word, text) %>%
  count(id, word) %>%
  pivot_wider(id_cols = id, names_from = word, values_from = n, values_fill = list(n = 0)) %>%
  select(-id)

# A tibble: 3 x 7
    fox nimble   red   dog  grey  lazy funny
  <int>  <int> <int> <int> <int> <int> <int>
1     1      1     1     0     0     0     0
2     0      0     0     1     1     1     0
3     1      0     1     0     0     0     1

You can also exit the framework early and just use table.

table(unnest_tokens(df, word, text))

   word
id  dog fox funny grey lazy nimble red
  1   0   1     0    0    0      1   1
  2   1   0     0    1    1      0   0
  3   0   1     1    0    0      0   1

Note that unnest_tokens() has an option to_lower = TRUE by default. You can change this to FALSE if you do not want that.

这篇关于通过单词出现表创建向量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆