在另一个字符串向量中查找字符串向量的匹配 [英] Find matches of a vector of strings in another vector of strings

查看:119
本文介绍了在另一个字符串向量中查找字符串向量的匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图创建新闻文章数据框的一个子集,其中提到了一组关键字或短语中的至少一个元素。

 #文章的示例数据框
articles< - data.frame(id = c(1,2, 3,4),text = c(Lorem ipsum dolor sit amet,consectetur adipisicing elit,sed do eiusmod,tempor incididunt ut labore et dolore magna aliqua.Ut enim ad minim veniam,,quis nostrud practitation ullamco laboris nisi ))
articles $ text< - as.character(articles $ text)

#$($)























$ b#关键字或短语的样本向量
关键字< - as.character(c(elit,tempor incididunt,reprehenderit))

#id文本
# 1 1 Lorem ipsum dolor sit amet,consectetur adipisicing elit,sed do eiusmod
#2 2 tempor incididunt ut labore et dolore magna aliqua。 U en minim minim ven ven $ $ $ $ $ $ $ $ $ 3 3 3 3 ul ul ul ul ul ul ul ul ul conse conse conse conse conse conse conse conse conse conse conse conse conse conse conse conse conse conse conse conse Duis aute irure dolor in reprehenderit in voluptate velit esse

给定关键字向量,子集应包含行1,2和4,因为这些行包含一个或多个矢量元素。



%in %in%似乎要求数据框中的每个单词都是矢量化的,所以 c grepl() ( articles $ text%in%keywords result in four FALSE s), grep( )似乎无法处理向量化模式( grep(关键字,文章$ text)给出错误)。单独的功能似乎在多个维度上运作良好(即,在所有行中搜索一个词很容易,但不是同时搜索所有3个词)。



找到并选择至少包含关键字向量中的一个元素的数据框的所有行的最佳方法是什么?

解决方案

您可以尝试将您的关键字粘贴在一起,并将它们与管道字符( | )分开,它们将像or一样工作,如下所示:

 >文章[grepl(粘贴(关键字,折叠=|),文章$文本),] 
id文本
1 1 Lorem ipsum dolor sit amet,consectetur adipisicing elit,sed do eiusmod
2 2 tempor incididunt ut labore et dolore magna aliqua。 U en minim minim ven ven $ $ $ $ $ $ $ $ $ $ $ $ $ $ Duis aute irure dolor in reprehenderit in voluptate velit esse


I'm trying to create a subset of a data frame of news articles that mention at least one element of a set of keywords or phrases.

# Sample data frame of articles
articles <- data.frame(id=c(1, 2, 3, 4), text=c("Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod", "tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,", "quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo", "consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse"))
articles$text <- as.character(articles$text)

# Sample vector of keywords or phrases
keywords <- as.character(c("elit", "tempor incididunt", "reprehenderit"))

#   id                                                                         text
# 1  1     Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
# 2  2 tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
# 3  3      quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
# 4  4    consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse

Given the vector of keywords, the subset should contain rows 1, 2, and 4, since those rows contain one or more of the elements of the vector.

Neither %in nor grepl() work, since %in% seems to require that each word in the data frame be vectorized (articles$text %in% keywords results in four FALSEs), and grep() doesn't seem to be able to handle vectorized patterns (grep(keywords, articles$text) gives an error). Neither function alone seems to work well across multiple dimensions (i.e. it would be easy to search for one word in all the rows, but not all 3 at the same time).

What's the best way to find and select all rows of the data frame that contain at least one of the elements of the keyword vector?

解决方案

You can try pasting your "keywords" together and separate them with the pipe character (|) which will work like an "or", like this:

> articles[grepl(paste(keywords, collapse="|"), articles$text),]
  id                                                                         text
1  1     Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
2  2 tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
4  4    consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse

这篇关于在另一个字符串向量中查找字符串向量的匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆