从 R 中的 Excel 文件中清除文本(删除停用词、标点符号等) [英] Cleaning text (remove stop words, punctuation etc) from Excel file in R

查看:82
本文介绍了从 R 中的 Excel 文件中清除文本(删除停用词、标点符号等)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于我的硕士论文,我正在分析一所大学的课程.我有 1134 门课程(作为行),有 3 个变量(作为列).由于我对 R 的经验很少,我正在努力为它编写代码.这是更多信息,在我有一个作为图像附加的数据库示例.

For my master thesis I am analyzing courses at a university. I have 1134 courses (as rows) with 3 variables (as columns). Due my little experience with R I am struggling with writing the code for it. Here's more information and in I have a sample of the database attached as a image.

第 1 列是课程名称第 2 列是课程描述第 3 列是学习成果

Column 1 is course name Column 2 is course description Column 3 is learning outcomes

我想清理数据并删除停用词、标点符号和其他不相关的字符.我使用以下代码执行此操作:

I want to clean the data and remove stop words, punctuation and other irrelevant characters. I do this with the following code:

rm(list=ls());
library(readxl);
library(MASS);
library(nnet);
library(NLP);
library(tm);
database <- read_excel("/Volumes/GoogleDrive/My Drive/TU e Innovation Management /Thesis/testdatabasematrix.xlsx");

#name columns
colnames(database)[1] <- "Name";
colnames(database)[2] <- "Description";
colnames(database)[3] <- "LearningOutcomes";

#replace punctuation
database2 <- gsub(pattern = "\\W", replace = " ", database)
#replace digits
database2 <- gsub(pattern="\\d", " ", database2)
#everything to lower
database2 <- tolower(database2)

#until here everything fine
database2 <- removeWords(database2, stopwords());

#When I try to save the database in a data frame, the output is merely 3 observations of 1 variable instead of 1141 obs. of 3 variables
database2 <- data.frame(database2)

我希望你能帮助我:).如果您需要更多信息,请说出来,我当然会提供.

I hope you can help me :). If you require more information, please say so and I'll provide it of course.

最好的,克里斯蒂安

推荐答案

您也可以考虑 tidytextdplyr 包,这绝对不错:

You may consider also the tidytext and dplyr package, that's definetely nice:

# some data similar to yours
database <- data.frame(Name = c('Aalto Fellows II', 'Aalto introduction to Services'),
                       Description = c('This course is a lot of words I do not know.','Service economy, whatever it does mean.'),
                       LearningOutcomes = c('Aalto Fellows, which are the smartest, learn.','Knowing what does Service economy means.'), stringsAsFactors = FALSE)

# cool packages
library(tidytext)
library(dplyr)

# here the text transformations for titles
title <- tibble(line = 1:nrow(database), text = database$Name) %>%        # as tibble
         unnest_tokens(word, text)%>%                                     # remove punctuations, lowercase, put words in column
         anti_join(stop_words, by = c("word" = "word")) %>%               # remove stopwords
         group_by(line) %>% summarise(title = paste(word,collapse =' '))  # now all in a row!

# here the text transformations for descriptions
description <- tibble(line = 1:nrow(database), text = database$Description) %>%
               unnest_tokens(word, text) %>%  
               anti_join(stop_words, by = c("word" = "word"))  %>%
               group_by(line) %>% summarise(title = paste(word,collapse =' '))

# here the text transformations for learning outcomes
learningoutcomes <- tibble(line = 1:nrow(database), text = database$LearningOutcomes) %>% 
                    unnest_tokens(word, text) %>%
                    anti_join(stop_words, by = c("word" = "word"))  %>%
                    group_by(line) %>% summarise(title = paste(word,collapse =' '))

# now the full dataset
database2 <- title %>% left_join(description, by = 'line') %>% left_join(learningoutcomes, by = 'line')
colnames(database2) <- c("line","Name","Description","LearningOutcomes")
database2

# A tibble: 2 x 4
   line Name                        Description     LearningOutcomes             
  <int> <chr>                       <chr>           <chr>                        
1     1 aalto fellows ii            lot words       aalto fellows smartest learn 
2     2 aalto introduction services service economy knowing service economy means

您可以使用 data.frame() 将其转换为 data.frame.

And you can convert it to a data.frame with data.frame().

这篇关于从 R 中的 Excel 文件中清除文本(删除停用词、标点符号等)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆