识别列中的模式,并将它们添加到数据框中的列中 [英] Recognize patterns in column, and add them to column in Data frame

查看:73
本文介绍了识别列中的模式,并将它们添加到数据框中的列中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有一列有 50 个关键字:

Got a column with 50 keywords:

Keyword1 
Keyword2
Keyword3
KeywordN=50

此外,我还得到了一个包含两列的数据框:标题和摘要.

In addition I got a data frame with two columns: Title and Abstract.

Title                    Abstract 
Rstudio Keyword1        A interesting program language keyword2  
Python Keyword3         A interesting program keyword3 language 

我想要一个额外的列(我们称之为关键字),如果关键字名称出现在标题或摘要中,则该列将出现,如下所示:

I want to get an extra column (let's call it Keywords), where the keyword name will appear IF it is in the Title or Abstract, like this:

Title             Abstract                                   Keywords
Rstudio Keyword1 A interesting program language keyword2  Keyword1, keyword2
Python Keyword2  A interesting program keyword3 language  Keyword2, Keyword3

我唯一能解决"这个问题的方法是创建一个二进制列(如果模式匹配).(grepl 函数),但这不是想要的解决方案......

The only thing how I could 'solve' this, was by making a binary columns (if a pattern matched). (grepl function), but that was not the desired solution...

推荐答案

in base R:

  • 这会处理标点符号、空格、行尾/行首.
  • 关键字可以包含空格和一些标点符号(但不是全部)
  • 新列中的关键字保持原始关键字向量的大小写:

代码

ind  <- sapply(paste0('(^|[ [:punct:]])',tolower(keywords),'($|[ [:punct:]])'),grep,tolower(paste(df$Title,df$Abstract)))
ind[lengths(ind)==0] <- NA # for cases where no keyword is found
ind2 <- do.call(rbind,Map(data.frame,keyword=keywords,i=ind))
ind3 <- aggregate(keyword ~ i,ind2,paste,collapse=', ')
df$keywords[ind3$i] <- ind3$keyword
df$keywords[is.na(df$keywords)] <- "" # replacing NAs with empty strings
#              Title                                Abstract           keywords
# 1 Rstudio Keyword1 A interesting program language keyword2 Keyword1, Keyword2
# 2  Python Keyword2 A interesting program keyword3 language Keyword2, Keyword3

数据

keywords <- c("Keyword1", "Keyword2", "Keyword3")

df <- read.table(text="Title                    Abstract 
                 'Rstudio Keyword1'        'A interesting program language keyword2'  
                 'Python Keyword2'         'A interesting program keyword3 language'",h=T,strin=F)

这篇关于识别列中的模式,并将它们添加到数据框中的列中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆