预处理:对数据框中的许多列进行文本分析 [英] Preprocessing: text analysis on many columns from a dataframe
本文介绍了预处理:对数据框中的许多列进行文本分析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
使用以下几行,可以对数据框的特定列中的文本进行预处理:
Using the following lines it is possible to preprocess text in a specific column of my dataframe:
#text to lower case
df$name <- tolower(df$name)
#remove all special characters
df$name <- gsub("[[:punct:]]", " ", df$name)
#remove long spaces
df$name <- gsub("\\s+"," ",str_trim(df$name))
我想在这样的数据框的所有列(期望ID)中实现此预处理规则:
I would like to implement this preprocessing rules in all columns (expect id) of a dataframe like this:
df <- data.frame(id = c("A","B","C"), D = c("mytext 11","mytext +", "!!"), E = c("text","stg","1.2"), F = c("press","remove","22"))
推荐答案
If you want to do something multiple times, it is often useful to define a function.
例如,您可以执行以下操作:
For example, you could do the following:
library(stringr)
df <- data.frame(id = c("A","B","C"), D = c("mytext 11","mytext +", "!!"),
E = c("text","stg","1.2"), F = c("press","remove","22"))
# create a function so we can apply this multiple times easily.
process <- function(my_vector)
{
my_vector <- tolower(my_vector)
#remove all special characters
my_vector <- gsub("[[:punct:]]", " ", my_vector)
#remove long spaces
my_vector <- gsub("\\s+"," ",str_trim(my_vector))
# return result
return(my_vector)
}
# for all columns except 'id', apply our function.
for(x in setdiff(colnames(df),"id"))
{
df[[x]]=process(df[[x]])
}
这篇关于预处理:对数据框中的许多列进行文本分析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文