预处理:对数据框中的许多列进行文本分析 [英] Preprocessing: text analysis on many columns from a dataframe

查看:41
本文介绍了预处理:对数据框中的许多列进行文本分析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用以下几行,可以对数据框的特定列中的文本进行预处理:

Using the following lines it is possible to preprocess text in a specific column of my dataframe:

#text to lower case
df$name <- tolower(df$name)
#remove all special characters
df$name <- gsub("[[:punct:]]", " ", df$name)
#remove long spaces
df$name <- gsub("\\s+"," ",str_trim(df$name))

我想在这样的数据框的所有列(期望ID)中实现此预处理规则:

I would like to implement this preprocessing rules in all columns (expect id) of a dataframe like this:

df  <- data.frame(id = c("A","B","C"), D = c("mytext 11","mytext +", "!!"), E = c("text","stg","1.2"), F = c("press","remove","22"))

推荐答案

如果要多次执行操作,定义

If you want to do something multiple times, it is often useful to define a function.

例如,您可以执行以下操作:

For example, you could do the following:

library(stringr)
df  <- data.frame(id = c("A","B","C"), D = c("mytext 11","mytext +", "!!"), 
                  E = c("text","stg","1.2"), F = c("press","remove","22"))

# create a function so we can apply this multiple times easily.
process <- function(my_vector)
{
  my_vector <- tolower(my_vector)
  #remove all special characters
  my_vector <- gsub("[[:punct:]]", " ", my_vector)
  #remove long spaces
  my_vector <- gsub("\\s+"," ",str_trim(my_vector))
  # return result
  return(my_vector)
}

# for all columns except 'id', apply our function.
for(x in setdiff(colnames(df),"id"))
{
 df[[x]]=process(df[[x]])
}

这篇关于预处理:对数据框中的许多列进行文本分析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆