从R中的另一个数据帧查找所有字符串匹配 [英] Finding all string matches from another dataframe in R

查看:137
本文介绍了从R中的另一个数据帧查找所有字符串匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在R中相对较新。

我有一个数据框 locs ,其中有1个变量 V1 ,看起来像:

I have a dataframe locs that has 1 variable V1 and looks like:

V1
edmonton general hospital
cardiovascular institute, hospital san carlos, madrid spain
hospital of santa maria, lisbon, portugal

和另一个具有以下两个变量的数据框 cities

and another dataframe cities that has two variables that look like this:

city              country
edmonton          canada
san carlos        spain
los angeles       united states
santa maria       united states
tokyo             japan
madrid            spain
santa maria       portugal
lisbon            portugal

我想在中创建两个新变量locs city 内的任何 V1 字符串匹配相关,从而使位置看起来像这样:

I want to create two new variables in locs that relates any string match of V1 within city so that locs looks like this:

V1                                            city                  country                      
edmonton general hospital                     edmonton              canada
hospital san carlos, madrid spain             san carlos, madrid    spain
hospital of santa maria, lisbon, portugal     santa maria, lisbon   portugal, united states

注意事项: V1 可能有多个国家/地区名称。另外,如果有一个重复的国家(例如,圣卡洛斯和马德里都在西班牙),那么我只想要该国家的一个实例。

A few things to note: V1 may have multiple country names. Also, if there is a repeat country (for instance, both san carlos and madrid are in spain), then I only want one instance of the country.

请告知。

谢谢。

推荐答案

使用 tidyverse 和 stringr locs2 是最终输出。

A solution using tidyverse and stringr. locs2 is the final output.

library(tidyverse)
library(stringr)

locs2 <- locs %>%
  rowwise() %>%
  mutate(city = list(str_match(V1, cities$city))) %>%
  unnest() %>%
  drop_na(city) %>%
  left_join(cities, by = "city") %>%
  group_by(V1) %>%
  summarise_all(funs(toString(sort(unique(.)))))

结果

locs2 %>% as.data.frame()
                                                           V1                city                 country
1 cardiovascular institute, hospital san carlos, madrid spain  madrid, san carlos                   spain
2                                   edmonton general hospital            edmonton                  canada
3                   hospital of santa maria, lisbon, portugal lisbon, santa maria portugal, united states

DATA

library(tidyverse)

locs <- data_frame(V1 = c("edmonton general hospital",
                   "cardiovascular institute, hospital san carlos, madrid spain",
                   "hospital of santa maria, lisbon, portugal"))

cities <- read.table(text = "city              country
edmonton          canada
'san carlos'        spain
'los angeles'       'united states'
'santa maria'       'united states'
tokyo             japan
madrid            spain
'santa maria'       portugal
lisbon            portugal",
                     header = TRUE, stringsAsFactors = FALSE)

这篇关于从R中的另一个数据帧查找所有字符串匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆