如何提取数据框列中的所有匹配模式(字符串中的单词)? [英] How to extract all matching patterns (words in a string) in a dataframe column?

查看:44
本文介绍了如何提取数据框列中的所有匹配模式(字符串中的单词)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据框.一个( txt.df )的一列包含我要从( text )中提取短语的文本.另一个( wrd.df )的列中包含短语( phrase ).两者都是具有复杂文本和字符串的大数据框,但可以说:

I have two dataframes. one (txt.df) has a column with a text I want to extract phrases from (text). The other (wrd.df) has a column with the phrases (phrase). both are big dataframes with complex texts and strings but lets say:

txt.df <- data.frame(id = c(1, 2, 3, 4, 5),
                     text = c("they love cats and dogs", "he is drinking juice", 
                              "the child is having a nap on the bed", "they jump on the bed and break it",
                              "the cat is sleeping on the bed"))


wrd.df <- data.frame(label = c('a', 'b', 'c', 'd', 'e', 'd'),
                     phrase = c("love cats", "love dogs", "juice drinking", "nap on the bed", "break the bed",
                              "sleeping on the bed"))

我最后需要的是一个 txt.df ,其中另一列包含所检测到的短语的标签.

what I finally need is a txt.df with another column which contains labels of the phrases detected.

我尝试在wrd.df中创建一列,在其中标记了这样的短语

what I tried was creating a column in wrd.df in which I tokenized the phrases like this

wrd.df$token <- sapply(wrd.df$phrase, function(x) unlist(strsplit(x, split = " ")))

,然后尝试编写一个自定义函数以使用grepl/str_detect遍历tokens列得到所有都是真实的名称(标签)

and then tried to write a custom function to sapply over the tokens column with grepl/str_detect get the names (labels) of those which were all true

Extract.Fun <- function(text, df, label, token){
  for (i in token) {
  truefalse[i] <- sapply(token[i], function (x) grepl(x, text))
  truenames[i] <- names(which(truefalse[i] == T))
  removedup[i] <- unique(truenames[i])
  return(removedup)
}

,然后在我的txt.df $ text上应用此自定义函数,以使用带有标签的新列.

and then sapply this custom function on my txt.df$text to have a new column with the labels.

txt.df$extract <- sapply(txt.df$text, function (x) Extract.Fun(x, wrd.df, "label", "token"))

但是我对自定义功能不满意,而且确实卡住了.我将不胜感激任何帮助.P.S.如果我还可以进行部分匹配,例如喝果汁"和打破床",那将是非常好的.但这不是优先事项……对于原始设备来说还不错.

but I'm not good with custom functions and am really stuck. I would appreciate any help. P.S. It would be very good if i could also have partial matches like "drink juice" and "broke the bed"... but it's not a priority... fine with the original ones.

推荐答案

如果您需要匹配确切的短语,则 fuzzyjoin -package中的 regex_join()您需要什么.

If you need to match the exact phrases, the regex_join() from the fuzzyjoin-package is what you need.

fuzzyjoin::regex_join( txt.df, wrd.df, by = c(text = "phrase"), mode = "left" )

  id                                 text label              phrase
1  1              they love cats and dogs     a           love cats
2  2                 he is drinking juice  <NA>                <NA>
3  3 the child is having a nap on the bed     d      nap on the bed
4  4    they jump on the bed and break it  <NA>                <NA>
5  5       the cat is sleeping on the bed     d sleeping on the bed

如果您想匹配所有单词,我想您可以根据涵盖此类行为的短语构建一个正则表达式...

If you want to match all words, I guess you can build a regex out of the phrases that cover such behaviour...

#build regex for phrases
#done by splitting the phrases to individual words, and then paste the regex together
wrd.df$regex <- unlist( lapply( lapply( strsplit( wrd.df$phrase, " "), 
                                        function(x) paste0( "(?=.*", x, ")", collapse = "" ) ),
                                function(x) paste0( "^", x, ".*$") ) )


fuzzyjoin::regex_join( txt.df, wrd.df, by = c(text = "regex"), mode = "left" )

  id                                 text label              phrase                                        regex
1  1              they love cats and dogs     a           love cats                     ^(?=.*love)(?=.*cats).*$
2  1              they love cats and dogs     b           love dogs                     ^(?=.*love)(?=.*dogs).*$
3  2                 he is drinking juice     c      juice drinking                ^(?=.*juice)(?=.*drinking).*$
4  3 the child is having a nap on the bed     d      nap on the bed      ^(?=.*nap)(?=.*on)(?=.*the)(?=.*bed).*$
5  4    they jump on the bed and break it     e       break the bed            ^(?=.*break)(?=.*the)(?=.*bed).*$
6  5       the cat is sleeping on the bed     d sleeping on the bed ^(?=.*sleeping)(?=.*on)(?=.*the)(?=.*bed).*$

这篇关于如何提取数据框列中的所有匹配模式(字符串中的单词)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆