使用 refinr 包比较和细化单独列中的字符串 [英] Compare and refine strings in separate columns with refinr package

查看:17
本文介绍了使用 refinr 包比较和细化单独列中的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的很多时间都花在合并关于国家、城市、姓名或政党列的两个数据框上.现在,它是 refinr,OpenRefine 的 R 端口,派上用场.只是我还没有弄清楚如何比较两个相同"的列并像我在单个向量上使用 refinr 一样命名字符串.我在 R 方面没有那么丰富的经验,所以这听起来可能有点含糊.也许我的例子让事情更清楚一些.

A lot of my time is spend in merging two data frames on the country, municipality, name or party column. Now, it's the refinr package, a R port to OpenRefine, that comes in handy. Only I haven't figured out yet how to compare two of 'the same' columns and name the strings like I use refinr on a single vector. I'm not that experienced in R so maybe this sounds a little bit vague. Maybe my examples make things a bit clearer.

library(tidyverse)
library(refinr)

# I would like to add the values (and the right name's) of this example df...
df1 <- tribble(
  ~uid, ~name, ~value,
  "A", "Red", 13,
  "A", "violet", 145,
  "B", "Blue", 3,
  "B", "yellow", 56,
  "C", "yellow-purple", 789,
  "C", "green", 17
  )

# ...to the following df
df2 <- tribble(
  ~uid, ~name,
  "A", "red",
  "B", "blu",
  "C", "YellowPurple",
  "C", "green"
  )

# The following code of course produces NA values
df3 <- left_join(df1, df2, by = c("uid", "name"))

# While the following is the desired outcome

# A tibble: 4 x 3
  uid   name           value
  <chr> <chr>          <dbl>
1 A     Red             13 
2 B     Blue             3
3 C     yellow-purple  789   
4 C     green           17

key_collision_merge()n_gram_merge() 处理单个向量中的字符串.我的问题是,我可以在两列而不是一列之间比较和更改字符串吗?

The key_collision_merge() and the n_gram_merge() work on strings in a single vector. My question is, can I compare and change strings between two columns instead of one?

如果可以的话,我的时间会安全很多!

If this is possible, it would safe me so much time!

提前致谢.

推荐答案

我不确定这是 refinr 的最佳用途,它主要用于协调单个列中的单词拼写.你想要做的看起来像一个模糊连接,并且有一个 R包.使用示例可能是:

I'm not sure this is the best use of refinr, which serves mostly to harmonize the word spelling within a single column. What you want to do looks like a fuzzy join, and there is an R package for that. An example of use could be:

library(tidyverse)
library(fuzzyjoin)


df1 <- tribble(
  ~uid, ~name, ~value,
  "A", "Red", 13,
  "A", "violet", 145,
  "B", "Blue", 3,
  "B", "yellow", 56,
  "C", "yellow-purple", 789,
  "C", "green", 17
)

# ...to the following df
df2 <- tribble(
  ~uid, ~name,
  "A", "red",
  "B", "blu",
  "C", "YellowPurple",
  "C", "green"
)

df3 <- df2 %>%
  stringdist_left_join(df1,
                       distance_col = "dist", 
                       method='soundex') %>% 
  select(uid=uid.x, name=name.y, value)

df3
  # A tibble: 4 x 3
  uid   name          value
  <chr> <chr>         <dbl>
1 A     Red              13
2 B     Blue              3
3 C     yellow-purple   789
4 C     green            17

我使用的是 soundex 算法,但还有其他方法,都是基于 stringdist 包.

I used the soundex algorithm, but there are other methods, all based on the stringdist package.

这篇关于使用 refinr 包比较和细化单独列中的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆