在R中连接不精确的字符串 [英] joining on inexact strings in R

查看:83
本文介绍了在R中连接不精确的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要加入两个表..但是我要加入的数据不完全匹配..加入NFL球员的名字..

I am looking to join two tables.. however the data I am looking to join on does not match exactly.. joining on NFL player names..

data

> dput(att75a)
structure(list(rusher_player_name = c("A.Ekeler", "A.Jones", 
"A.Kamara", "A.Mattison", "A.Peterson", "B.Hill"), mean_epa = c(-0.110459963350783, 
0.0334332018597805, -0.119488111742492, -0.155261835310445, -0.123485646124451, 
-0.0689611296359916), success_rate = c(0.357664233576642, 0.40495867768595, 
0.401129943502825, 0.283018867924528, 0.322727272727273, 0.35
), plays = c(137L, 242L, 177L, 106L, 220L, 80L)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -6L))

> dput(rb2019capa)
structure(list(rusher_player_name = c("Aaron Jones", "Adrian Peterson", 
"Alexander Mattison", "Alvin Kamara", "Austin Ekeler", "Brian Hill"
), Team = c("Packers", "Redskins", "Vikings", "Saints", "Chargers", 
"Falcons"), `Salary Cap Value` = c(695487, 1780000, 700545, 1050693, 
646668, 645000), `Cash Spent` = c(645000, 2530000, 1317180, 807500, 
645000, 645000)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-6L))

例如,我正尝试在Alexander Mattison上加入A.Mattison。等等。

for example I am trying to join A.Mattison on Alexander Mattison.. and so on..

我尝试了stringdist和Fuzzyjoin,但无法解决我的问题。

i experimented with stringdist and fuzzyjoin but could not solve my problem..

请考虑...采用每个数据集的head()来压缩每个提问准则。.原始数据集的长度为51 obs。和168个Obs ...会影响联接的执行方式吗?

please consider... took the head() of each dataset to condense per question asking guidelines.. original data sets have lengths of 51 obs. and 168 obs... will that affect how the join is performed?

清理这些名称的最佳方法是什么?

What is the best way to go about cleaning these names?

谢谢您的时间。.

推荐答案

使用 sub 将名字替换为名字的首字母。

Use sub to replace first name with first initial.

library(dplyr)

rb2019capa %>%
  mutate(rusher_player_name=
         sub("^([A-Z])\\S+\\s([A-Za-z].*)$", "\\1.\\2", rusher_player_name)) %>%
  inner_join(att75a, by="rusher_player_name") # or left_join (up to you)







# A tibble: 6 x 7
  rusher_player_name Team     `Salary Cap Value` `Cash Spent` mean_epa success_rate plays
  <chr>              <chr>                 <dbl>        <dbl>    <dbl>        <dbl> <int>
1 A.Jones            Packers              695487       645000   0.0334        0.405   242
2 A.Peterson         Redskins            1780000      2530000  -0.123         0.323   220
3 A.Mattison         Vikings              700545      1317180  -0.155         0.283   106
4 A.Kamara           Saints              1050693       807500  -0.119         0.401   177
5 A.Ekeler           Chargers             646668       645000  -0.110         0.358   137
6 B.Hill             Falcons              645000       645000  -0.0690        0.35     80

这篇关于在R中连接不精确的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆